[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6fa9b33c-b661-f0f6-1965-e379c7201172@akamai.com>
Date: Wed, 29 Apr 2020 00:12:03 -0400
From: Jason Baron <jbaron@...mai.com>
To: Roman Penyaev <rpenyaev@...e.de>
Cc: Khazhismel Kumykov <khazhy@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Alexander Viro <viro@...iv.linux.org.uk>, Heiher <r@....cc>,
stable@...r.kernel.org
Subject: Re: [PATCH v2] eventpoll: fix missing wakeup for ovflist in
ep_poll_callback
On 4/28/20 2:10 PM, Roman Penyaev wrote:
> On 2020-04-27 22:38, Jason Baron wrote:
>> On 4/25/20 4:59 PM, Khazhismel Kumykov wrote:
>>> On Sat, Apr 25, 2020 at 9:17 AM Jason Baron <jbaron@...mai.com> wrote:
>>>>
>>>>
>>>>
>>>> On 4/24/20 3:00 PM, Khazhismel Kumykov wrote:
>>>>> In the event that we add to ovflist, before 339ddb53d373 we would be
>>>>> woken up by ep_scan_ready_list, and did no wakeup in ep_poll_callback.
>>>>> With that wakeup removed, if we add to ovflist here, we may never wake
>>>>> up. Rather than adding back the ep_scan_ready_list wakeup - which was
>>>>> resulting in unnecessary wakeups, trigger a wake-up in ep_poll_callback.
>>>>
>>>> I'm just curious which 'wakeup' we are talking about here? There is:
>>>> wake_up(&ep->wq) for the 'ep' and then there is the nested one via:
>>>> ep_poll_safewake(ep, epi). It seems to me that its only about the later
>>>> one being missing not both? Is your workload using nested epoll?
>>>>
>>>> If so, it might make sense to just do the later, since the point of
>>>> the original patch was to minimize unnecessary wakeups.
>>>
>>> The missing wake-ups were when we added to ovflist instead of rdllist.
>>> Both are "the ready list" together - so I'd think we'd want the same
>>> wakeups regardless of which specific list we added to.
>>> ep_poll_callback isn't nested specific?
>>>
>>
>> So I was thinking that ep_poll() would see these events on the
>> ovflist without an explicit wakeup, b/c the overflow list being active
>> implies that the ep_poll() path would add them to the rdllist in
>> ep_scan_read_list(). Thus, it will see the events either in the
>> current ep_poll() context or via a subsequent syscall to epoll_wait().
>>
>> However, there are other paths that can call ep_scan_ready_list() thus
>> I agree with you that both wakeups here are necessary.
>>
>> I do think are are still (smaller) potential races in ep_scan_ready_list()
>> where we have:
>>
>> write_lock_irq(&ep->lock);
>> list_splice_init(&ep->rdllist, &txlist);
>> WRITE_ONCE(ep->ovflist, NULL);
>> write_unlock_irq(&ep->lock);
>>
>> And in the ep_poll path we have:
>>
>> static inline int ep_events_available(struct eventpoll *ep)
>> {
>> return !list_empty_careful(&ep->rdllist) ||
>> READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR;
>> }
>>
>>
>> Seems to me that first bit needs to be the following, since
>> ep_events_available() is now checked in a lockless way:
>>
>>
>> write_lock_irq(&ep->lock);
>> WRITE_ONCE(ep->ovflist, NULL);
>> smp_wmb();
>> list_splice_init(&ep->rdllist, &txlist);
>> write_unlock_irq(&ep->lock);
>
>
> Hi Jason,
>
> For the first chunk you refer the order seems irrelevant.
> Either you see something not empty, you go take the lock
> and then check lists under the lock, either you see all
> lists are empty.
>
Hi Roman,
It does matter. Let's say we have:
epfd1->epfd2->socket. And thread a is doing an
epoll_wait() on epfd1, and thread b is doing
epoll_wait on epfd2. then:
1) data comes in on socket
ep_poll_callback() wakes up threads a and b
2) thread a runs
ep_poll()
ep_scan_ready_list()
ep_send_events_proc()
ep_item_poll()
ep_scan_ready_list()
list_splice_init(&ep->rdllist, &txlist);
3) now thread b is running
ep_poll()
ep_events_available()
returns false
schedule_hrtimeout_range()
Thus, thread c has missed a wakeup and will never
get it.
Similarly, for the second chunk I referenced.
>>
>> And also this bit:
>>
>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR)>>
>> /*
>> * Quickly re-inject items left on "txlist".
>> */
>> list_splice(&txlist, &ep->rdllist);
>>
>> Should I think better be reversed as well to:
>>
>> list_splice(&txlist, &ep->rdllist);
>> smp_wmb();
>> WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR);
>
> But this one is much more interesting. I understand what you
> are trying to achieve: we can't leave both lists empty for the
> short period of time, if there is something left the caller
> of ep_events_available() should be able to see one of the lists
> is not empty, otherwise it can be too late.
>
> But the problem you've spotted is much worse. Some remains
> can be in the txlist (this can happen if caller of epoll_wait
> wants to harvest only 1 event, but there are more in the ->rdlist).
> And we again get the lost wakeup.
>
> Problem is reproduced by the test below. The idea is simple:
> we have 10 threads and 10 event fds. Each thread can harvest
> only 1 event. 1 producer fires all 10 events at once and waits
> that all 10 events will be observed by 10 threads.
>
> The fix is basically a revert of 339ddb53d373 with 1 major
> exception: we do wakeups from ep_scan_ready_list() but
> if txlist is not empty && if ep_scan_ready_list() is called
> from the routine, which sends events, not reads it
> (thus we protect ourselves from repeated wake ups)
>
> I will send the code a bit later.
This was discussed as part of the original discussion around
339ddb53d373: https://lkml.org/lkml/2019/10/7/905
The context was more a performance difference rather than
a semantic difference, but as I pointed out I believe that
behavior pre-dates the above commit and goes back to:
86c0517 fs/epoll: deal with wait_queue only once
There, since the thread is left on the waitqueue over the
ep_scan_ready_list() thus the ep_wakeup() (that was removed
in 339ddb53d373), would no longer wakeup other potential
waiters.
So since I think this behavior change goes back to 5.0 and there
really haven't been any reports, I don't think there are
too many apps relying on these semantics that your test
case is showing. It would be interesting to confirm that
your test does indeed succeed/fail before/after that patch.
Also, as part of that original discussion, you had a patch
that I think addresses this. I would be ok with that, in
addition to a patch to address the ordering issue I pointed
out. I can post a patch for the former, if you think this
plan makes sense?
Thanks,
-Jason
>
> --
> Roman
>
> ---- test -------
>
> enum {
> EPOLL60_EVENTS_NR = 10,
> };
>
> struct epoll60_ctx {
> volatile int stopped;
> int ready;
> int waiters;
> int epfd;
> int evfd[EPOLL60_EVENTS_NR];
> };
>
> static inline int count_waiters(struct epoll60_ctx *ctx)
> {
> return __atomic_load_n(&ctx->waiters, __ATOMIC_ACQUIRE);
> }
>
> static void *epoll60_wait_thread(void *ctx_)
> {
> struct epoll60_ctx *ctx = ctx_;
> struct epoll_event e;
> uint64_t v;
> int ret;
>
> while (!ctx->stopped) {
> /* Mark we are ready */
> __atomic_fetch_add(&ctx->ready, 1, __ATOMIC_ACQUIRE);
>
> /* Start when all are ready */
> while (__atomic_load_n(&ctx->ready, __ATOMIC_ACQUIRE) &&
> !ctx->stopped);
>
> /* Account this waiter */
> __atomic_fetch_add(&ctx->waiters, 1, __ATOMIC_ACQUIRE);
> again_wait:
> ret = epoll_wait(ctx->epfd, &e, 1, 1000);
> if (ret != 1) {
> /* Should be stopped, otherwise we lost wakeup */
> assert(ctx->stopped);
> return NULL;
> }
>
> ret = read(e.data.fd, &v, sizeof(v));
> if (ret != sizeof(v)) {
> /* Event was stollen by other thread */
> goto again_wait;
> }
> __atomic_fetch_sub(&ctx->waiters, 1, __ATOMIC_RELEASE);
> }
>
> return NULL;
> }
>
> static inline unsigned long long msecs(void)
> {
> struct timespec ts;
> unsigned long long msecs;
>
> clock_gettime(CLOCK_REALTIME, &ts);
> msecs = ts.tv_sec * 1000ull;
> msecs += ts.tv_nsec / 1000000ull;
>
> return msecs;
> }
>
> TEST(epoll60)
> {
> struct epoll60_ctx ctx = { 0 };
> pthread_t waiters[ARRAY_SIZE(ctx.evfd)];
> struct epoll_event e;
> int i, n, ret;
>
> signal(SIGUSR1, signal_handler);
>
> ctx.epfd = epoll_create1(0);
> ASSERT_GE(ctx.epfd, 0);
>
> /* Create event fds */
> for (i = 0; i < ARRAY_SIZE(ctx.evfd); i++) {
> ctx.evfd[i] = eventfd(0, EFD_NONBLOCK);
> ASSERT_GE(ctx.evfd[i], 0);
>
> e.events = EPOLLIN | EPOLLERR;
> e.data.fd = ctx.evfd[i];
> ASSERT_EQ(epoll_ctl(ctx.epfd, EPOLL_CTL_ADD, ctx.evfd[i], &e), 0);
> }
>
> /* Create waiter threads */
> for (i = 0; i < ARRAY_SIZE(waiters); i++)
> ASSERT_EQ(pthread_create(&waiters[i], NULL,
> epoll60_wait_thread, &ctx), 0);
>
> for (i = 0; i < 300; i++) {
> uint64_t v = 1, ms;
>
> /* Wait for all to be ready */
> while (__atomic_load_n(&ctx.ready, __ATOMIC_ACQUIRE) !=
> ARRAY_SIZE(ctx.evfd))
> ;
>
> /* Steady, go */
> __atomic_fetch_sub(&ctx.ready, ARRAY_SIZE(ctx.evfd),
> __ATOMIC_ACQUIRE);
>
> /* Wait all have gone to kernel */
> while (count_waiters(&ctx) != ARRAY_SIZE(ctx.evfd))
> ;
>
> /* 1ms should be enough to schedule out */
> usleep(1000);
>
> /* Quickly signal all handles at once */
> for (n = 0; n < ARRAY_SIZE(ctx.evfd); n++) {
> ret = write(ctx.evfd[n], &v, sizeof(v));
> ASSERT_EQ(ret, sizeof(v));
> }
>
> /* Busy loop for 1s and wait for all waiters to wake up */
> ms = msecs();
> while (count_waiters(&ctx) && msecs() < ms + 3000)
> ;
>
> ASSERT_EQ(count_waiters(&ctx), 0);
> }
> ctx.stopped = 1;
> /* Stop waiters */
> for (i = 0; i < ARRAY_SIZE(waiters); i++) {
> pthread_kill(waiters[i], SIGUSR1);
> pthread_join(waiters[i], NULL);
> }
>
> for (i = 0; i < ARRAY_SIZE(waiters); i++)
> close(ctx.evfd[i]);
> close(ctx.epfd);
> }
>
>
Powered by blists - more mailing lists