[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080327203927.GA19968@Krystal>
Date: Thu, 27 Mar 2008 16:39:27 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca>
To: Ingo Molnar <mingo@...e.hu>
Cc: akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [patch for 2.6.26 0/7] Architecture Independent Markers
* Ingo Molnar (mingo@...e.hu) wrote:
>
> * Mathieu Desnoyers <mathieu.desnoyers@...ymtl.ca> wrote:
>
> > Hi Andrew,
> >
> > After a few RFC rounds, I propose these markers for 2.6.26. They
> > include work done after comments from the memory management community.
> > Most of them have been used by the LTTng project for about 2 years.
>
> very strong NACK. When markers went into 2.6.24 i initially believed
> your claims that my earlier objections about markers have been resolved
> and that it's a lightweight, useful facility.
>
> so we optimistically used markers in ftrace (see sched-devel.git) for
> the scheduler, and i was shocked about marker impact:
>
> just 3 ftrace markers in the scheduler plus their support code bloated
> the kernel by 5k (!), 288 bytes for only 3 markers in the scheduler
> itself, the rest in support code to manage the markers - that's 96 bytes
> added per every marker (44 (!) bytes of that in the fastpath!).
>
> 44 bytes per marker per fastpast is _NOT_ acceptable in any way, shape
> or form. Those 3 limited markers have the same cache cost as adding
> mcount callbacks for dyn-ftrace to the _whole_ scheduler ...
>
> as i told you many, many moons ago, repeatedly: acceptable cost is a 5
> bytes callout that is patched to a NOP, and _maybe_ a little bit more to
> prepare parameters for the function calls. Not 44 bytes. Not 96 bytes.
> Not 5K total cost. Paravirt ops are super-lightweight in comparison.
>
> and this stuff _can_ be done sanely and cheaply and in fact we have done
> it: see ftrace in sched-devel.git, and compare its cost.
>
> see further details in the tongue-in-cheek commit below.
>
Hi Ingo,
Let's compare one marker against one ftrace statement in sched.o on the
sched-dev tree on x86_32 and see where your "bloat" impression about markers
comes from. I think it's mostly due to the different metrics we use.
sched.o w/o CONFIG_CONTEXT_SWITCH_TRACER
text data bss dec hex filename
46564 2924 200 49688 c218 kernel/sched.o
Let's get an idea of CONFIG_CONTEXT_SWITCH_TRACER impact on sched.o :
sched.o with CONFIG_CONTEXT_SWITCH_TRACER
text data bss dec hex filename
46788 2924 200 49912 c2f8 kernel/sched.o
224 bytes added for 6 ftrace_*(). This is partly due to the helper function
ftrace_all_fair_tasks(). So let's be fair and not take it in account.
Only the cost for one ftrace_*(). All the others commented out, leaving this
one :
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
ftrace_ctx_switch(rq, prev, next);
...
text data bss dec hex filename
46644 2924 200 49768 c268 kernel/sched.o
Commenting this one out :
text data bss dec hex filename
46628 2924 200 49752 c258 kernel/sched.o
For an extra 16 bytes (13 + alignment).
Due to this addition to schedule fast path :
movl %ebx, %ecx
movl -48(%ebp), %edx
movl -40(%ebp), %eax
call ftrace_ctx_switch
corresponding to :
38c: 89 d9 mov %ebx,%ecx
38e: 8b 55 d0 mov -0x30(%ebp),%edx
391: 8b 45 d8 mov -0x28(%ebp),%eax
394: e8 fc ff ff ff call 395 <schedule+0x12c>
Which adds 13 bytes to the fast path. It reads the stack to populate the
registers even when the code is dynamically disabled. The size of this code
directly depends on the number of parameters passed to the tracer. It would also
have to dereference pointers from memory if there would happen to be some data
not present on the stack. All this when disabled. I suppose you patch a no-op
instead of the call to dynamically disable it.
Changing this for a trace_mark :
trace_mark(ctx_switch, "rq %p prev %p next %p", rq, prev, next);
Adds this to schedule fast path :
(this is without immediate values)
cmpb $0, __mark_ctx_switch.33881+8
jne .L2164
corresponding to :
38c: 80 3d 08 00 00 00 00 cmpb $0x0,0x8
393: 0f 85 0c 03 00 00 jne 6a5 <schedule+0x43c>
(13 bytes in the fast path, including a memory reference)
With immediate values optimization, we do better :
mov $0,%al
testb %al, %al
jne .L2164
Corresponding to :
389: b0 00 mov $0x0,%al
38b: 84 c0 test %al,%al
38d: 0f 85 0c 03 00 00 jne 69f <schedule+0x436>
(10 bytes in the fast path instead of 13, and we remove any memory reference)
Near the end of schedule, we find the jump target :
.L2164:
movl %ebx, 20(%esp)
movl -48(%ebp), %edx
movl %edx, 16(%esp)
movl %ecx, 12(%esp)
movl $.LC108, 8(%esp)
movl $0, 4(%esp)
movl $__mark_ctx_switch.33881, (%esp)
call *__mark_ctx_switch.33881+12
jmp .L2126
6a5: 89 5c 24 14 mov %ebx,0x14(%esp)
6a9: 8b 55 d0 mov -0x30(%ebp),%edx
6ac: 89 54 24 10 mov %edx,0x10(%esp)
6b0: 89 4c 24 0c mov %ecx,0xc(%esp)
6b4: c7 44 24 08 f7 04 00 movl $0x4f7,0x8(%esp)
6bb: 00
6bc: c7 44 24 04 00 00 00 movl $0x0,0x4(%esp)
6c3: 00
6c4: c7 04 24 00 00 00 00 movl $0x0,(%esp)
6cb: ff 15 0c 00 00 00 call *0xc
6d1: e9 c3 fc ff ff jmp 399 <schedule+0x130>
Which adds an extra 50 bytes.
With immediate values optimization, we have a total size of
text data bss dec hex filename
46767 2956 200 49923 c303 kernel/sched.o
(baseline)
text data bss dec hex filename
46638 2924 200 49762 c262 kernel/sched.o
We add 129 bytes of text here. Which does not balance. We should add 60 bytes. I
guess some code alignment is the cause. Let's look at the size of schedule()
instead, since this is the only code I touch :
With immediate values optimization, with the marker :
00000269 <schedule>
...
0000086c <cond_resched_softirq>
1539 bytes
And without the marker :
00000269 <schedule>
...
0000082d <cond_resched_softirq>
1476 bytes
For an added 63 bytes to schedule, which balances modulo some alignment.
If we look at the surrounding of the added 50 bytes (label .L2164) at the end of
schedule(), we see the assembly :
....
.L2103:
movl -32(%ebp), %eax
testl %eax, %eax
je .L2101
movl $0, 68(%esi)
jmp .L2089
.L2106:
movl $0, -32(%ebp)
.p2align 4,,3
jmp .L2089
.L2164:
movl %ebx, 20(%esp)
movl -48(%ebp), %edx
movl %edx, 16(%esp)
movl %ecx, 12(%esp)
movl $.LC108, 8(%esp)
movl $0, 4(%esp)
movl $__mark_ctx_switch.33909, (%esp)
call *__mark_ctx_switch.33909+12
jmp .L2126
.L2124:
movl -40(%ebp), %eax
call _spin_unlock_irq
.p2align 4,,6
jmp .L2141
.L2161:
movl $1, %ecx
movl $2, %eax
call profile_hits
.p2align 4,,4
jmp .L2069
.L2160:
movl -48(%ebp), %edx
movl 192(%edx), %eax
testl %eax, %eax
jne .L2066
movl %edx, %eax
call __schedule_bug
jmp .L2066
....
Which are all targets of "unlikely" branches. Therefore, it shares a cache line
with these targets on architectures with associative L1 i-cache. I don't see
how this could be considered as "fast path".
Therefore, on a 3 arguments marker (with immediate values), the marker seems
to outperform ftrace on the following items :
- Adds 10 bytes to the fast path instead of 13.
- No memory read is required on the fast path when the marker is dynamically
disabled.
- The added fast path code size does not depend on the number of parameters
passed.
- The runtime cost, when dynamically disabled, does not depend on the number of
parameters passed.
However, you are right in that the _total_ code size of the ftrace statement is
smaller, but since it is all located and executed in the fast path, even when
dynamically disabled, I don't see this as an overall improvement.
About the cost of code size required to handle the data afterward : it will be
amortized by a common infrastructure such as LTTng, where the same code will
translate the data received as parameter into a trace.
Regards,
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists