lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 12 May 2009 15:01:12 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Wu Fengguang <fengguang.wu@...el.com>
Cc:	Ingo Molnar <mingo@...e.hu>, Li Zefan <lizf@...fujitsu.com>,
	Tom Zanussi <tzanussi@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Andi Kleen <andi@...stfloor.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Larry Woodman <lwoodman@...hat.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Eduard - Gabriel Munteanu <eduard.munteanu@...ux360.ro>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Matt Mackall <mpm@...enic.com>,
	Alexey Dobriyan <adobriyan@...il.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export
	more page flags in /proc/kpageflags)

On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> > 
> > 
> > * Wu Fengguang <fengguang.wu@...el.com> wrote:
> > 
> > > > The above 'get object state' interface (which allows passive 
> > > > sampling) - integrated into the tracing framework - would serve 
> > > > that goal, agreed?
> > > 
> > > Agreed. That could in theory a good complement to dynamic 
> > > tracings.
> > > 
> > > Then what will be the canonical form for all the 'get object 
> > > state' interfaces - "object.attr=value", or whatever? [...]
> > 
> > Lemme outline what i'm thinking of.
> > 
> > I'd call the feature "object collection tracing", which would live 
> > in /debug/tracing, accessed via such files:
> > 
> >   /debug/tracing/objects/mm/pages/
> >   /debug/tracing/objects/mm/pages/format
> >   /debug/tracing/objects/mm/pages/filter
> >   /debug/tracing/objects/mm/pages/trace_pipe
> >   /debug/tracing/objects/mm/pages/stats
> >   /debug/tracing/objects/mm/pages/events/
> > 
> > here's the (proposed) semantics of those files:
> > 
> > 1) /debug/tracing/objects/mm/pages/
> > 
> > There's a subsystem / object basic directory structure to make it 
> > easy and intuitive to find our way around there.
> > 
> > 2) /debug/tracing/objects/mm/pages/format
> > 
> > the format file:
> > 
> >   /debug/tracing/objects/mm/pages/format
> > 
> > Would reuse the existing dynamic-tracepoint structured-logging 
> > descriptor format and code (this is upstream already):
> > 
> >  [root@...enix sched_signal_send]# pwd
> >  /debug/tracing/events/sched/sched_signal_send
> > 
> >  [root@...enix sched_signal_send]# cat format 
> >  name: sched_signal_send
> >  ID: 24
> >  format:
> > 	field:unsigned short common_type;		offset:0;	size:2;
> > 	field:unsigned char common_flags;		offset:2;	size:1;
> > 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> > 	field:int common_pid;				offset:4;	size:4;
> > 	field:int common_tgid;				offset:8;	size:4;
> > 
> > 	field:int sig;					offset:12;	size:4;
> > 	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
> > 	field:pid_t pid;				offset:32;	size:4;
> > 
> >  print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid
> > 
> > These format descriptors enumerate fields, types and sizes, in a 
> > structured way that user-space tools can parse easily. (The binary 
> > records that come from the trace_pipe file follow this format 
> > description.)
> > 
> > 3) /debug/tracing/objects/mm/pages/filter
> > 
> > This is the tracing filter that can be set based on the 'format' 
> > descriptor. So with the above (signal-send tracepoint) you can 
> > define such filter expressions:
> > 
> >   echo "(sig == 10 && comm == bash) || sig == 13" > filter
> > 
> > To restrict the 'scope' of the object collection along pretty much 
> > any key or combination of keys. (Or you can leave it as it is and 
> > dump all objects and do keying in user-space.)
> > 
> > [ Using in-kernel filtering is obviously faster that streaming it 
> >   out to user-space - but there might be details and types of 
> >   visualization you want to do in user-space - so we dont want to 
> >   restrict things here. ]
> > 
> > For the mm object collection tracepoint i could imagine such filter 
> > expressions:
> > 
> >   echo "type == shared && file == /sbin/init" > filter
> > 
> > To dump all shared pages that are mapped to /sbin/init.
> > 
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> > 
> > The 'trace_pipe' file can be used to dump all objects in the 
> > collection, which match the filter ('all objects' by default). The 
> > record format is described in 'format'.
> > 
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a 
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> > 
> > 5) /debug/tracing/objects/mm/pages/stats
> > 
> > The 'stats' file would be a reuse of the existing histogram code of 
> > the tracing code. We already make use of it for the branch tracers 
> > and for the workqueue tracer - it could be extended to be applicable 
> > to object collections as well.
> > 
> > The advantage there would be that there's no dumping at all - all 
> > the integration is done straight in the kernel. ( The 'filter' 
> > condition is listened to - increasing flexibility. The filter file 
> > could perhaps also act as a default histogram key. )
> > 
> > 6) /debug/tracing/objects/mm/pages/events/
> > 
> > The 'events' directory offers links back to existing dynamic 
> > tracepoints that are under /debug/tracing/events/. This would serve 
> > as an additional coherent force that keeps dynamic tracepoints 
> > collected by subsystem and by object type as well. (Tools could make 
> > use of this information as well - without being aware of actual 
> > object semantics.)
> > 
> > 
> > There would be a number of other object collections we could 
> > enumerate:
> > 
> >  tasks:
> > 
> >   /debug/tracing/objects/sched/tasks/
> > 
> >  active inodes known to the kernel:
> > 
> >   /debug/tracing/objects/fs/inodes/
> > 
> >  interrupts:
> > 
> >   /debug/tracing/objects/hw/irqs/
> > 
> > etc.
> > 
> > These would use the same 'object collection' framework. Once done we 
> > can use it for many other thing too.
> > 
> > Note how organically integrated it all is with the tracing 
> > framework. You could start from an 'object view' to get an overview 
> > and then go towards a more dynamic view of specific object 
> > attributes (or specific objects), as you drill down on a specific 
> > problem you want to analyze.
> > 
> > How does this all sound to you?
> 
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
> 
> Its basic form is:
> 
> #      ino       size   cached cached% refcnt state       age accessed  process         dev             file
> [snip]
>        320          1        4     100      1    D-     50443     1085 udevd           00:11(tmpfs)     /.udev/uevent_seqnum
>     460725        123      124     100     35    --     50444     6795 touch           08:02(sda2)      /lib/libpthread-2.9.so
>     460727         31       32     100     14    --     50444     2007 touch           08:02(sda2)      /lib/librt-2.9.so
>     458865         97       80      82      1    --     50444       49 mount           08:02(sda2)      /lib/libdevmapper.so.1.02.1
>     460090         15       16     100      1    --     50444       48 mount           08:02(sda2)      /lib/libuuid.so.1.2
>     458866         46       48     100      1    --     50444       47 mount           08:02(sda2)      /lib/libblkid.so.1.0
>     460732         43       44     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_nis-2.9.so
>     460739         87       88     100     73    --     50444     3597 rcS             08:02(sda2)      /lib/libnsl-2.9.so
>     460726         31       32     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_compat-2.9.so
>     458804        250      252     100     11    --     50445     8175 rcS             08:02(sda2)      /lib/libncurses.so.5.6
>     229540        780      752      96      3    --     50445     7594 init            08:02(sda2)      /bin/bash
>     460735         15       16     100     89    --     50445    17581 init            08:02(sda2)      /lib/libdl-2.9.so
>     460721       1344     1340      99    117    --     50445    48732 init            08:02(sda2)      /lib/libc-2.9.so
>     458801        107      104      97     24    --     50445     3586 init            08:02(sda2)      /lib/libselinux.so.1
>     671870         37       24      65      1    --     50446        1 swapper         08:02(sda2)      /sbin/init
>        175          1    24412     100      1    --     50446        0 swapper         00:01(rootfs)    /dev/root
> 
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
>         inode_in_use
>         inode_unused
>         sb->s_dirty
>         sb->s_io
>         sb->s_more_io
>         sb->s_inodes
> 
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
>         { LS_OPT_DIRTY,         "dirty"         },
>         { LS_OPT_CLEAN,         "clean"         },
>         { LS_OPT_INUSE,         "inuse"         },
>         { LS_OPT_EMPTY,         "empty"         },
>         { LS_OPT_ALL,           "all"           },
>         { LS_OPT_DEV,           "dev=%s"        },
> 
> There are two possible challenges for the conversion:
> 
> - One trick it does is to select different lists to traverse on
>   different filter options. Will this be possible in the object
>   tracing framework?



Yeah, I guess.



> - The file name lookup(last field) is the performance killer. Is it
>   possible to skip the file name lookup when the filter failed on the
>   leading fields?


objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.

But I guess we can do something about the performances...

Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.

Thanks,
Frederic.


> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
> 
> > Can you see any conceptual holes in the scheme, any use-case that 
> > /proc/kpageflags supports but the object collection approach does 
> > not?
> 
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
> 
> > Would you be interested in seeing something like this, if we tried 
> > to implement it in the tracing tree? The majority of the code 
> > already exists, we just need interest from the MM side and we have 
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
> 
> Definitely! /proc/filecache has another 'page view':
> 
>         # head /proc/filecache
>         # file /bin/bash
>         # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
>         # idx   len     state           refcnt
>         0       1       RAMU________    4
>         3       8       RAMU________    4
>         12      1       RAMU________    4
>         14      5       RAMU________    4
>         20      7       RAMU________    4
>         27      2       RAMU________    5
>         29      1       RAMU________    4
> 
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
> 
> Thanks and Regards,
> Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ