linux-kernel - Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <54EDEEC2.2040201@akamai.com>
Date:	Wed, 25 Feb 2015 10:48:18 -0500
From:	Jason Baron <jbaron@...mai.com>
To:	Eric Wong <normalperson@...t.net>
CC:	Ingo Molnar <mingo@...nel.org>, peterz@...radead.org,
	mingo@...hat.com, viro@...iv.linux.org.uk,
	akpm@...ux-foundation.org, davidel@...ilserver.org,
	mtk.manpages@...il.com, linux-kernel@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-api@...r.kernel.org,
	Thomas Gleixner <tglx@...utronix.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	"luto@...capital.net >> Andy Lutomirski" <luto@...capital.net>
Subject: Re: [PATCH v2 2/2] epoll: introduce EPOLLEXCLUSIVE and EPOLLROUNDROBIN

On 02/21/2015 07:24 PM, Eric Wong wrote:
> Jason Baron <jbaron@...mai.com> wrote:
>> On 02/18/2015 12:51 PM, Ingo Molnar wrote:
>>> * Ingo Molnar <mingo@...nel.org> wrote:
>>>
>>>>> [...] However, I think the userspace API change is less 
>>>>> clear since epoll_wait() doesn't currently have an 
>>>>> 'input' events argument as epoll_ctl() does.
>>>> ... but the change would be a bit clearer and somewhat 
>>>> more flexible: LIFO or FIFO queueing, right?
>>>>
>>>> But having the queueing model as part of the epoll 
>>>> context is a legitimate approach as well.
>>> Btw., there's another optimization that the networking code 
>>> already does when processing incoming packets: waking up a 
>>> thread on the local CPU, where the wakeup is running.
>>>
>>> Doing the same on epoll would have real scalability 
>>> advantages where incoming events are IRQ driven and are 
>>> distributed amongst multiple CPUs.
>>>
>>> Where events are task driven the scheduler will already try 
>>> to pair up waker and wakee so it might not show up in 
>>> measurements that markedly.
>>>
>> Right, so this makes me think that we may want to potentially
>> support a variety of wakeup policies. Adding these to the
>> generic wake up code is just going to be too messy. So, perhaps
>> a better approach here would be to register a single
>> wait_queue_t with the event source queue that will always
>> be woken up, and then layer any epoll balancing/irq affinity
>> policies on top of that. So in essence we end up with sort of
>> two queues layers, but I think it provides much nicer isolation
>> between layers. Also, the bulk of the changes are going to be
>> isolated to the epoll code, and we avoid Andy's concern about
>> missing, or starving out wakeups.
>>
>> So here's a stab at how this API could look:
>>
>> 1. ep1 = epoll_create1(EPOLL_POLICY);
>>
>> So EPOLL_POLICY here could the round robin policy described
>> here, or the irq affinity or other ideas. The idea is to create
>> an fd that is local to the process, such that other processes
>> can not subsequently attach to it and affect our policy.
> I'm not against defining more policies if needed.
> Maybe FIFO vs LIFO is a good case for this.
>
> For affinity, it could probably be done transparently based on
> epoll_wait retrievals + EPOLL_CTL_MOD operations.
>
>> 2. epoll_ctl(ep1, EPOLL_CTL_ADD, fd_source, NULL);
>>
>> This associates ep1 with the event source. ep1 can be
>> associated with or added to at most 1 wakeup source. This call
>> would largely just form the association, but not queue anything
>> to the fd_source wait queue.
> This would mean one extra FD for every fd_source, but that's
> only a handful of FDs (listen sockets), correct?

Yes, one extra epoll fd per shared wakeup source, so this should
result in very few additional fds.

>> 3. epoll_ctl(ep2, EPOLL_CTL_ADD, ep1, event);
>>     epoll_ctl(ep3, EPOLL_CTL_ADD, ep1, event);
>>     epoll_ctl(ep4, EPOLL_CTL_ADD, ep1, event);
>>      .
>>      .
>>      .
>>
>> Finally, we add the epoll sets to the event source (indirectly via
>> ep1). So the first add would actually queue the callback to the
>> fd_source. While the subsequent calls would simply queue things
>> to the 'nested' wakeup queue associated with ep1.
> I'm not sure I follow, wouldn't this increase the number of wakeups?

I agree, my text there is confusing...I've posted this idea as
v3 of this series, so hopefully that clarifies this approach.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/