[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e552262b-2069-075e-f7db-cec19a12a363@kernel.dk>
Date: Fri, 31 May 2019 10:54:27 -0600
From: Jens Axboe <axboe@...nel.dk>
To: Roman Penyaev <rpenyaev@...e.de>
Cc: Azat Khuzhin <azat@...event.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Al Viro <viro@...iv.linux.org.uk>,
Linus Torvalds <torvalds@...ux-foundation.org>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 00/13] epoll: support pollable epoll from userspace
On 5/31/19 10:02 AM, Roman Penyaev wrote:
> On 2019-05-31 16:48, Jens Axboe wrote:
>> On 5/16/19 2:57 AM, Roman Penyaev wrote:
>>> Hi all,
>>>
>>> This is v3 which introduces pollable epoll from userspace.
>>>
>>> v3:
>>> - Measurements made, represented below.
>>>
>>> - Fix alignment for epoll_uitem structure on all 64-bit archs except
>>> x86-64. epoll_uitem should be always 16 bit, proper BUILD_BUG_ON
>>> is added. (Linus)
>>>
>>> - Check pollflags explicitly on 0 inside work callback, and do
>>> nothing
>>> if 0.
>>>
>>> v2:
>>> - No reallocations, the max number of items (thus size of the user
>>> ring)
>>> is specified by the caller.
>>>
>>> - Interface is simplified: -ENOSPC is returned on attempt to add a
>>> new
>>> epoll item if number is reached the max, nothing more.
>>>
>>> - Alloced pages are accounted using user->locked_vm and limited to
>>> RLIMIT_MEMLOCK value.
>>>
>>> - EPOLLONESHOT is handled.
>>>
>>> This series introduces pollable epoll from userspace, i.e. user
>>> creates
>>> epfd with a new EPOLL_USERPOLL flag, mmaps epoll descriptor, gets
>>> header
>>> and ring pointers and then consumes ready events from a ring, avoiding
>>> epoll_wait() call. When ring is empty, user has to call epoll_wait()
>>> in order to wait for new events. epoll_wait() returns -ESTALE if user
>>> ring has events in the ring (kind of indication, that user has to
>>> consume
>>> events from the user ring first, I could not invent anything better
>>> than
>>> returning -ESTALE).
>>>
>>> For user header and user ring allocation I used vmalloc_user(). I
>>> found
>>> that it is much easy to reuse remap_vmalloc_range_partial() instead of
>>> dealing with page cache (like aio.c does). What is also nice is that
>>> virtual address is properly aligned on SHMLBA, thus there should not
>>> be
>>> any d-cache aliasing problems on archs with vivt or vipt caches.
>>
>> Why aren't we just adding support to io_uring for this instead? Then we
>> don't need yet another entirely new ring, that's is just a little
>> different from what we have.
>>
>> I haven't looked into the details of your implementation, just curious
>> if there's anything that makes using io_uring a non-starter for this
>> purpose?
>
> Afaict the main difference is that you do not need to recharge an fd
> (submit new poll request in terms of io_uring): once fd has been added
> to
> epoll with epoll_ctl() - we get events. When you have thousands of fds
> -
> that should matter.
>
> Also interesting question is how difficult to modify existing event
> loops
> in event libraries in order to support recharging (EPOLLONESHOT in terms
> of epoll).
>
> Maybe Azat who maintains libevent can shed light on this (currently I
> see
> that libevent does not support "EPOLLONESHOT" logic).
In terms of existing io_uring poll support, which is what I'm guessing
you're referring to, it is indeed just one-shot. But there's no reason
why we can't have it persist until explicitly canceled with POLL_REMOVE.
--
Jens Axboe
Powered by blists - more mailing lists