源码基于:Linux5.10
约定:
- 芯片架构:ARM64
- 内存架构:UMA
- CONFIG_ARM64_VA_BITS:39
- CONFIG_ARM64_PAGE_SHIFT:12
- CONFIG_PGTABLE_LEVELS :3
1. 使用
内核需要使能:
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASKSTATS=y
启动参数中默认是使能的,可以在启动参数中添加 nodelayacct 来disable,截止 5.10 版本都是通过该参数进行 disable。
最新的版本也可以通过 sysctl 的 kernel.task_delayacct 进行开关。
当使能 CONFIG_TASK_DELAY_ACCT 时,在 struct task_struct 中会多个成员:
struct task_struct {
...
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif
...
}
在系统启动后,可以使用 getdelays命令来访问指定的 pid 或 tgid 的task:
shift:/ # getdelays -d -p 25348 -v
print delayacct stats ON
debug on
family id 26
Sent pid/tgid, retval 0
received 380 bytes
nlmsghdr size=16, nlmsg_len=380, rep_len=380
PID 25348
CPU count real total virtual total delay total delay average
1126 779761825 752721329 944603556 0.839ms
IO count delay total delay average
201 660349228 3ms
SWAP count delay total delay average
0 0 0ms
RECLAIM count delay total delay average
10 56215518 5ms
THRASHING count delay total delay average
57 326115261 5ms
getdelays 命令将dealyacct 分成了几个维度:
- CPU
- IO
- SWAP
- RECLAIM
- THRASHING
2. struct task_delay_info
include/linux/delayacct.h
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info {
raw_spinlock_t lock;
unsigned int flags; /* Private per-task flags */
/* For each stat XXX, add following, aligned appropriately
*
* struct timespec XXX_start, XXX_end;
* u64 XXX_delay;
* u32 XXX_count;
*
* Atomicity of updates to XXX_delay, XXX_count protected by
* single lock above (split into XXX_lock if contention is an issue).
*/
/*
* XXX_count is incremented on every XXX operation, the delay
* associated with the operation is added to XXX_delay.
* XXX_delay contains the accumulated delay time in nanoseconds.
*/
u64 blkio_start; /* Shared by blkio, swapin */
u64 blkio_delay; /* wait for sync block io completion */
u64 swapin_delay; /* wait for swapin block io completion */
u32 blkio_count; /* total count of the number of sync block */
/* io operations performed */
u32 swapin_count; /* total count of the number of swapin block */
/* io operations performed */
u64 freepages_start;
u64 freepages_delay; /* wait for memory reclaim */
u64 thrashing_start;
u64 thrashing_delay; /* wait for thrashing page */
u32 freepages_count; /* total count of memory reclaim */
u32 thrashing_count; /* total count of thrash waits */
};
#endif
- lock:同步锁,用于保护结构体;
- flags:用于标识任务阻塞的原因,DELAYACCT_PF_SWAPIN 和 DELAYACCT_PF_BLKIO,分别表示处于swapin 状态和等IO 状态;
- blkio_start:swapin、io 公用,记录开始的时间点,单位 ns;
- blkio_delay:task 等待 IO 资源而阻塞的时长,单位 ns,详细看 __schedule() 和 try_to_wake_up() 函数;
- swapin_delay:swapin 耗时,单位 ns,详细看 do_swap_page() 函数;
- blkio_count:task 等待 IO 资源而阻塞的次数;
- swapin_count:swapin 次数;
- freepages_start:内存回收的开始时间点,单位 ns,详细查看 do_try_to_free_pages();
- feepages_delay:内存回收耗时,单位 ns,详细查看 do_try_to_free_pages();
- freepages_count:内存回收的次数;
- thrashing_start:页面抖动的起始时间点,单位 ns,详细查看 wait_on_page_bit_common();
- thrashing_delay:页面抖动时长,单位ns,详细查看 wait_on_page_bit_common();
- thrashing_count:页面抖动次数;
2.1 IO 耗时统计
在主调度函数 __schedule() 中会检测 task_struct 中 in_iowait 这个变量,如果为 true 表示调度在等待 IO 资源,此时会调用 delayacct_blkio_start() 开始计时,并将其存放在 blkio_start 中。
当 IO 资源可用时,该阻塞task 会被唤醒,进而会调用 delayacct_blkio_end() 结束计时。此函数中会统计blkio_delay 和 blkio_count。
kernel/sched/core.c
static void __sched notrace __schedule(bool preempt)
{
...
if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
...
}
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
...
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
}
...
}
...
}
另外,有一种情况,这个等待 IO 资源是page fault 时需要 swapin,此时会将结构体中 flags 标志加上(或运算) DELAYACCT_PF_SWAPIN,用以统计此次 swapin 的耗时,当成功获得IO 资源后,此次 IO 耗时会被统计到 swapin_delay 里,而不被统计到 blkio_delay。即 blkio_delay 统计的是非swapin 的IO 耗时,而 swapin_delay 统计的是 swapin 的 IO 耗时。
mm/memory.c
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
...
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
swapcache = page;
...
locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
...
}
2.2 reclaim 耗时统计
在内存紧张时会进行内存回收,最终会调用到 do_try_to_free_pages() 函数,这里会统计整个回收过程所产生的耗时(调用 shrink_zones 函数)。
mm/vmscan.c
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct scan_control *sc)
{
...
retry:
delayacct_freepages_start();
...
delayacct_freepages_end();
...
}
2.3 thrashing 耗时统计
当 task 访问刚刚到 LRU_INACTIVE 的缓存页时,被认为产生抖动。在 wait_on_page_bit_common() 函数中会统计这次抖动的耗时以及抖动的次数。
mm/filemap.c
static inline __sched int wait_on_page_bit_common(wait_queue_head_t *q,
struct page *page, int bit_nr, int state, enum behavior behavior)
{
...
if (bit_nr == PG_locked &&
!PageUptodate(page) && PageWorkingset(page)) {
if (!PageSwapBacked(page)) {
delayacct_thrashing_start();
delayacct = true;
}
psi_memstall_enter(&pflags);
thrashing = true;
}
...
finish_wait(q, wait);
if (thrashing) {
if (delayacct)
delayacct_thrashing_end();
psi_memstall_leave(&pflags);
}
...
}
3. 初始化
在 start_kernel() 中会调用 delayacct_init() 函数对其进行初始化:
kernel/delayacct.c
void delayacct_init(void)
{
delayacct_cache = KMEM_CACHE(task_delay_info, SLAB_PANIC|SLAB_ACCOUNT);
delayacct_tsk_init(&init_task);
}
include/linux/delayacct.h
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
/* reinitialize in case parent's non-null pointer was dup'ed*/
tsk->delays = NULL;
if (delayacct_on)
__delayacct_tsk_init(tsk);
}
kernel/delayacct.c
void __delayacct_tsk_init(struct task_struct *tsk)
{
tsk->delays = kmem_cache_zalloc(delayacct_cache, GFP_KERNEL);
if (tsk->delays)
raw_spin_lock_init(&tsk->delays->lock);
}
其实,就是通过kmem_cache_zalloc() 函数从 delayacct_cache 中分配对象。
参考:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/translations/zh_CN/accounting/delay-accounting.rst
https://justinwei.blog.csdn.net/article/details/128287053