页表之上——Linux内核的内存管理

pwn kernel

 2024/10/30 

我们都知道，进程内部寻址用的是虚拟地址，虚拟地址要经过mmu被映射到物理地址，进程的页表是独立的等等。但是那么多页表，内核又是怎么调度页表的呢？在我们向内核申请下一个4kb的时候究竟发生了什么？

参考自：https://arttnba3.cn/2021/02/21/OS-0X00-LINUX-KERNEL-PART-I
以及一篇更加入门的介绍：https://cloud.tencent.com/developer/article/1775509
https://segmentfault.com/a/1190000043626203

内核视角下的“主存”

在内核的视图下，内存自顶向下有3级的管理，依次是节点（node），区（zone）和页（page）或者页框（page frame）。我们从最高（最大）的等级依次向下看。

什么是节点？

https://blog.csdn.net/gatieme/article/details/52384075
写的很细很好，不再摘抄了，了解节点可以直接点进去看按我看这两篇blog的理解，节点大概就是物理内存条，虽然肯定不是这个意思，但是在抽象的层级上来说，地位大概是物理内存条这一级。这个东西涉及到硬件处理器的设计。一般来讲，对于现代多核CPU访问内存有两种架构，UMA（均匀存储器存取，Uniform-Memory-Access）和NUMA（非均匀存储器存取）。

先说UMA,这种思想就是对于多个CPU, 他们对所有主存都有同样的访问级别，从总线存取的时间等等都基本一样，外围设备也能够共享。大火都一样。

节点的概念则来自于NUMA架构。简单来讲，虽然每个CPU都能访问全部的物理内存，但是它们访问“自己的”本地存储会更快。我们把这样的本地存储叫做“簇（bank）”，把这些不同的cpu叫做“节点（node）”。而在操作系统内存分配的时候，我们说的节点也就指这些内存簇了。

LINUX需要的是一种体系无关的内存分配结构，所以它采用了NUMA的节点概念。对于UMA,我们就认为只有一个节点，将他转化成一种伪NUMA的体系进行处理就好了。所以以UMA架构为例的话，节点就是全部的物理存储单元。

linux使用pg_data_t来描述一个节点。

/*
 * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
 * (mostly NUMA machines?) to denote a higher-level memory zone than the
 * zone denotes.
 *
 * On NUMA machines, each NUMA node would have a pg_data_t to describe
 * it's memory layout.
 *
 * Memory statistics and page replacement data structures are maintained on a
 * per-zone basis.
 */
struct bootmem_data;
typedef struct pglist_data {
    /*  包含了结点中各内存域的数据结构 , 可能的区域类型用zone_type表示*/
    struct zone node_zones[MAX_NR_ZONES];
    /*  指点了备用结点及其内存域的列表，以便在当前结点没有可用空间时，在备用结点分配内存   */
    struct zonelist node_zonelists[MAX_ZONELISTS];
    int nr_zones;                                   /*  保存结点中不同内存域的数目    */
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
    struct page *node_mem_map;      /*  指向page实例数组的指针，用于描述结点的所有物理内存页，它包含了结点中所有内存域的页。    */
#ifdef CONFIG_PAGE_EXTENSION
    struct page_ext *node_page_ext;
#endif
#endif
#ifndef CONFIG_NO_BOOTMEM
       /*  在系统启动boot期间，内存管理子系统初始化之前，
       内核页需要使用内存（另外，还需要保留部分内存用于初始化内存管理子系统）
       为解决这个问题，内核使用了自举内存分配器 
       此结构用于这个阶段的内存管理  */
    struct bootmem_data *bdata;
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
    /*
     * Must be held any time you expect node_start_pfn, node_present_pages
     * or node_spanned_pages stay constant.  Holding this will also
     * guarantee that any pfn_valid() stays that way.
     *
     * pgdat_resize_lock() and pgdat_resize_unlock() are provided to
     * manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG.
     *
     * Nests above zone->lock and zone->span_seqlock
     * 当系统支持内存热插拨时，用于保护本结构中的与节点大小相关的字段。
     * 哪调用node_start_pfn，node_present_pages，node_spanned_pages相关的代码时，需要使用该锁。
     */
    spinlock_t node_size_lock;
#endif
    /* /*起始页面帧号，指出该节点在全局mem_map中的偏移
    系统中所有的页帧是依次编号的，每个页帧的号码都是全局唯一的（不只是结点内唯一）  */
    unsigned long node_start_pfn;
    unsigned long node_present_pages; /* total number of physical pages 结点中页帧的数目 */
    unsigned long node_spanned_pages; /* total size of physical page range, including holes                     该结点以页帧为单位计算的长度，包含内存空洞 */
    int node_id;        /*  全局结点ID，系统中的NUMA结点都从0开始编号  */
    wait_queue_head_t kswapd_wait;      /*  交换守护进程的等待队列，
    在将页帧换出结点时会用到。后面的文章会详细讨论。    */
    wait_queue_head_t pfmemalloc_wait;
    struct task_struct *kswapd;     /* Protected by  mem_hotplug_begin/end() 指向负责该结点的交换守护进程的task_struct。   */
    int kswapd_max_order;                       /*  定义需要释放的区域的长度  */
    enum zone_type classzone_idx;

#ifdef CONFIG_COMPACTION
    int kcompactd_max_order;
    enum zone_type kcompactd_classzone_idx;
    wait_queue_head_t kcompactd_wait;
    struct task_struct *kcompactd;
#endif

#ifdef CONFIG_NUMA_BALANCING
    /* Lock serializing the migrate rate limiting window */
    spinlock_t numabalancing_migrate_lock;

    /* Rate limiting time interval */
    unsigned long numabalancing_migrate_next_window;

    /* Number of pages migrated during the rate limiting time interval */
    unsigned long numabalancing_migrate_nr_pages;
#endif

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
    /*
     * If memory initialisation on large machines is deferred then this
     * is the first PFN that needs to be initialised.
     */
    unsigned long first_deferred_pfn;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
    spinlock_t split_queue_lock;
    struct list_head split_queue;
    unsigned long split_queue_len;
#endif
} pg_data_t;

什么是区？

参考自https://www.cnblogs.com/linhaostudy/p/10006723.html#_label0
上面说道我们每个cpu都有一个节点内存。对于每个节点，又被划分为不同的区。一个管理区域通过struct zone_struct描述, 其被定义为zone_t, 用以表示内存的某个范围, 低端范围的16MB被描述为ZONE_DMA, 某些工业标准体系结构中的(ISA)设备需要用到它, 然后是可直接映射到内核的普通内存域ZONE_NORMAL,最后是超出了内核段的物理地址域ZONE_HIGHMEM, 被称为高端内存（64位已经不用了）.　是系统中预留的可用内存空间, 不能被内核直接映射.

这么大费周章又是要干嘛？其实这三个区都有说法。最低16mb是为了兼容isa总线dma处理器用的（虽然我不知道这是什么），中间能直接线性映射的就直接映射，而对于现代32位x86架构只能寻址4g,很多内存没法直接映射，就又要单独讨论，这么着划分的3个区。

其实还有其他标记节点内存区的标记：比如ZONE_MOVEABLE, ZONE_DEVICE等这种伪内存区，为了热插拔等等特性所设计，这里不再展开。而现在的AMD64架构已经不再需要高端内存了，128T足够将所有物理内存线性映射到内核

这也成为后面我们要谈的ret2dir攻击手法的开始

最后简单谈谈32位下内核空间1g用户空间3g,超出normal区的高端内存如何访问。按这篇blog来讲，就是临时替换页表，从虚拟地址选一段出来做逻辑地址空间然后临时借用，换掉页表建立映射，用完归还。另外，intel似乎支持一种叫做PAE页表扩展的技术，可以通过扩展一级页表让系统访问更多的内存，这里也不再展开。

struct zone
{
    /* Read-mostly fields */

    /* zone watermarks, access with *_wmark_pages(zone) macros */
    unsigned long watermark[NR_WMARK];

    unsigned long nr_reserved_highatomic;

    /*
     * We don't know if the memory that we're going to allocate will be
     * freeable or/and it will be released eventually, so to avoid totally
     * wasting several GB of ram we must reserve some of the lower zone
     * memory (otherwise we risk to run OOM on the lower zones despite
     * there being tons of freeable ram on the higher zones).  This array is
     * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
     * changes.
     * 分别为各种内存域指定了若干页
     * 用于一些无论如何都不能失败的关键性内存分配。
     */
    long lowmem_reserve[MAX_NR_ZONES];

#ifdef CONFIG_NUMA
    int node;
#endif

    /*
     * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
     * this zone's LRU.  Maintained by the pageout code.
     * 不活动页的比例,
     * 接着是一些很少使用或者大部分情况下是只读的字段：
     * wait_table wait_table_hash_nr_entries wait_table_bits
     * 形成等待列队，可以等待某一页可供进程使用  */
    unsigned int inactive_ratio;

    /*  指向这个zone所在的pglist_data对象  */
    struct pglist_data      *zone_pgdat;
    /*/这个数组用于实现每个CPU的热/冷页帧列表。内核使用这些列表来保存可用于满足实现的“新鲜”页。但冷热页帧对应的高速缓存状态不同：有些页帧很可能在高速缓存中，因此可以快速访问，故称之为热的；未缓存的页帧与此相对，称之为冷的。*/
    struct per_cpu_pageset __percpu *pageset;

    /*
     * This is a per-zone reserve of pages that are not available
     * to userspace allocations.
     * 每个区域保留的不能被用户空间分配的页面数目
     */
    unsigned long       totalreserve_pages;

#ifndef CONFIG_SPARSEMEM
    /*
     * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
     * In SPARSEMEM, this map is stored in struct mem_section
     */
    unsigned long       *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

#ifdef CONFIG_NUMA
    /*
     * zone reclaim becomes active if more unmapped pages exist.
     */
    unsigned long       min_unmapped_pages;
    unsigned long       min_slab_pages;
#endif /* CONFIG_NUMA */

    /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT
     * 只内存域的第一个页帧 */
    unsigned long       zone_start_pfn;

    /*
     * spanned_pages is the total pages spanned by the zone, including
     * holes, which is calculated as:
     *      spanned_pages = zone_end_pfn - zone_start_pfn;
     *
     * present_pages is physical pages existing within the zone, which
     * is calculated as:
     *      present_pages = spanned_pages - absent_pages(pages in holes);
     *
     * managed_pages is present pages managed by the buddy system, which
     * is calculated as (reserved_pages includes pages allocated by the
     * bootmem allocator):
     *      managed_pages = present_pages - reserved_pages;
     *
     * So present_pages may be used by memory hotplug or memory power
     * management logic to figure out unmanaged pages by checking
     * (present_pages - managed_pages). And managed_pages should be used
     * by page allocator and vm scanner to calculate all kinds of watermarks
     * and thresholds.
     *
     * Locking rules:
     *
     * zone_start_pfn and spanned_pages are protected by span_seqlock.
     * It is a seqlock because it has to be read outside of zone->lock,
     * and it is done in the main allocator path.  But, it is written
     * quite infrequently.
     *
     * The span_seq lock is declared along with zone->lock because it is
     * frequently read in proximity to zone->lock.  It's good to
     * give them a chance of being in the same cacheline.
     *
     * Write access to present_pages at runtime should be protected by
     * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
     * present_pages should get_online_mems() to get a stable value.
     *
     * Read access to managed_pages should be safe because it's unsigned
     * long. Write access to zone->managed_pages and totalram_pages are
     * protected by managed_page_count_lock at runtime. Idealy only
     * adjust_managed_page_count() should be used instead of directly
     * touching zone->managed_pages and totalram_pages.
     */
    unsigned long       managed_pages;
    unsigned long       spanned_pages;             /*  总页数，包含空洞  */
    unsigned long       present_pages;              /*  可用页数，不包哈空洞  */

    /*  指向管理区的传统名字, "DMA", "NROMAL"或"HIGHMEM" */
    const char          *name;

#ifdef CONFIG_MEMORY_ISOLATION
    /*
     * Number of isolated pageblock. It is used to solve incorrect
     * freepage counting problem due to racy retrieving migratetype
     * of pageblock. Protected by zone->lock.
     */
    unsigned long       nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
    /* see spanned/present_pages for more description */
    seqlock_t           span_seqlock;
#endif

    /*
     * wait_table       -- the array holding the hash table
     * wait_table_hash_nr_entries   -- the size of the hash table array
     * wait_table_bits      -- wait_table_size == (1 << wait_table_bits)
     *
     * The purpose of all these is to keep track of the people
     * waiting for a page to become available and make them
     * runnable again when possible. The trouble is that this
     * consumes a lot of space, especially when so few things
     * wait on pages at a given time. So instead of using
     * per-page waitqueues, we use a waitqueue hash table.
     *
     * The bucket discipline is to sleep on the same queue when
     * colliding and wake all in that wait queue when removing.
     * When something wakes, it must check to be sure its page is
     * truly available, a la thundering herd. The cost of a
     * collision is great, but given the expected load of the
     * table, they should be so rare as to be outweighed by the
     * benefits from the saved space.
     *
     * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
     * primary users of these fields, and in mm/page_alloc.c
     * free_area_init_core() performs the initialization of them.
     */
    /*  进程等待队列的散列表, 这些进程正在等待管理区中的某页  */
    wait_queue_head_t       *wait_table;
    /*  等待队列散列表中的调度实体数目  */
    unsigned long       wait_table_hash_nr_entries;
    /*  等待队列散列表数组大小, 值为2^order  */
    unsigned long       wait_table_bits;

    ZONE_PADDING(_pad1_)

    /* free areas of different sizes
       页面使用状态的信息，以每个bit标识对应的page是否可以分配
       是用于伙伴系统的，每个数组元素指向对应阶也表的数组开头
       以下是供页帧回收扫描器(page reclaim scanner)访问的字段
       scanner会跟据页帧的活动情况对内存域中使用的页进行编目
       如果页帧被频繁访问，则是活动的，相反则是不活动的，
       在需要换出页帧时，这样的信息是很重要的：   */
    struct free_area    free_area[MAX_ORDER];

    /* zone flags, see below 描述当前内存的状态, 参见下面的enum zone_flags结构 */
    unsigned long       flags;

    /* Write-intensive fields used from the page allocator, 保存该描述符的自旋锁  */
    spinlock_t          lock;

    ZONE_PADDING(_pad2_)

    /* Write-intensive fields used by page reclaim */

    /* Fields commonly accessed by the page reclaim scanner */
    spinlock_t          lru_lock;   /* LRU(最近最少使用算法)活动以及非活动链表使用的自旋锁  */
    struct lruvec       lruvec;

    /*
     * When free pages are below this point, additional steps are taken
     * when reading the number of free pages to avoid per-cpu counter
     * drift allowing watermarks to be breached
     * 在空闲页的数目少于这个点percpu_drift_mark的时候
     * 当读取和空闲页数一样的内存页时，系统会采取额外的工作，
     * 防止单CPU页数漂移，从而导致水印被破坏。
     */
    unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
    /* pfn where compaction free scanner should start */
    unsigned long       compact_cached_free_pfn;
    /* pfn where async and sync compaction migration scanner should start */
    unsigned long       compact_cached_migrate_pfn[2];
#endif

#ifdef CONFIG_COMPACTION
    /*
     * On compaction failure, 1<<compact_defer_shift compactions
     * are skipped before trying again. The number attempted since
     * last failure is tracked with compact_considered.
     */
    unsigned int        compact_considered;
    unsigned int        compact_defer_shift;
    int                       compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
    /* Set to true when the PG_migrate_skip bits should be cleared */
    bool            compact_blockskip_flush;
#endif

    bool            contiguous;

    ZONE_PADDING(_pad3_)
    /* Zone statistics 内存域的统计信息, 参见后面的enum zone_stat_item结构 */
    atomic_long_t       vm_stat[NR_VM_ZONE_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;

剩下的内容太复杂了，不再赘述。总结参看下方链接即可。总的来讲，“区”只是一种逻辑上的分组，方便操作系统按一定的结构和顺序去管理页的分配。并不是一种物理上的强制措施。

参考https://www.cnblogs.com/linhaostudy/p/10006723.html#autoid-6-4-0

什么是页框？

https://blog.csdn.net/gatieme/article/details/52384636
这应该是一个大佬一系列的文章

终于到了我们熟悉的内容，区下面管理的就是一个个页框了。linux也用一个结构体管理物理page frame叫做struct page。考虑到大量的page数量，page结构要尽量的小，因此这个结构里使用了大量的联合体。

/*
 * Each physical page in the system has a struct page associated with
 * it to keep track of whatever it is we are using the page for at the
 * moment. Note that we have no way to track which tasks are using
 * a page, though if it is a pagecache page, rmap structures can tell us
 * who is mapping it.
 *
 * The objects in struct page are organized in double word blocks in
 * order to allows us to use atomic double word operations on portions
 * of struct page. That is currently only used by slub but the arrangement
 * allows the use of atomic double word operations on the flags/mapping
 * and lru list pointers also.
 */
struct page {
    /* First double word block */
    unsigned long flags;        /* Atomic flags, some possibly updated asynchronously
                                              描述page的状态和其他信息  */
    union
    {
        struct address_space *mapping;  /* If low bit clear, points to
                         * inode address_space, or NULL.
                         * If page mapped as anonymous
                         * memory, low bit is set, and
                         * it points to anon_vma object:
                         * see PAGE_MAPPING_ANON below.
                         */
        void *s_mem;            /* slab first object 现移动至struct slab结构体*/ 
        atomic_t compound_mapcount;     /* first tail page */
        /* page_deferred_list().next     -- second tail page */
    };

    /* Second double word */
    struct {
        union {
            pgoff_t index;      /* Our offset within mapping.
            在映射的虚拟空间（vma_area）内的偏移；
            一个文件可能只映射一部分，假设映射了1M的空间，
            index指的是在1M空间内的偏移，而不是在整个文件内的偏移。 */
            void *freelist;     /* sl[aou]b first free object */
            /* page_deferred_list().prev    -- second tail page */
        };

        union {
#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
            /* Used for cmpxchg_double in slub */
            unsigned long counters;
#else
            /*
             * Keep _refcount separate from slub cmpxchg_double
             * data.  As the rest of the double word is protected by
             * slab_lock but _refcount is not.
             */
            unsigned counters;
#endif

            struct {

                union {
                    /*
                     * Count of ptes mapped in mms, to show
                     * when page is mapped & limit reverse
                     * map searches.
                     * 页映射计数器
                     */
                    atomic_t _mapcount;

                    struct { /* SLUB */
                        unsigned inuse:16;
                        unsigned objects:15;
                        unsigned frozen:1;
                    };
                    int units;      /* SLOB */
                };
                /*
                 * Usage count, *USE WRAPPER FUNCTION*
                 * when manual accounting. See page_ref.h
                 * 页引用计数器
                 */
                atomic_t _refcount;
            };
            unsigned int active;    /* SLAB */
        };
    };

    /*
     * Third double word block
     *
     * WARNING: bit 0 of the first word encode PageTail(). That means
     * the rest users of the storage space MUST NOT use the bit to
     * avoid collision and false-positive PageTail().
     */
    union {
        struct list_head lru;   /* Pageout list, eg. active_list
                     * protected by zone->lru_lock !
                     * Can be used as a generic list
                     * by the page owner.
                     */
        struct dev_pagemap *pgmap; /* ZONE_DEVICE pages are never on an
                        * lru or handled by a slab
                        * allocator, this points to the
                        * hosting device page map.
                        */
        struct {        /* slub per cpu partial pages */
            struct page *next;      /* Next partial slab */
#ifdef CONFIG_64BIT
            int pages;      /* Nr of partial slabs left */
            int pobjects;   /* Approximate # of objects */
#else
            short int pages;
            short int pobjects;
#endif
        };

        struct rcu_head rcu_head;       /* Used by SLAB
                         * when destroying via RCU
                         */
        /* Tail pages of compound page */
        struct {
            unsigned long compound_head; /* If bit zero is set */

            /* First tail page only */
#ifdef CONFIG_64BIT
            /*
             * On 64 bit system we have enough space in struct page
             * to encode compound_dtor and compound_order with
             * unsigned int. It can help compiler generate better or
             * smaller code on some archtectures.
             */
            unsigned int compound_dtor;
            unsigned int compound_order;
#else
            unsigned short int compound_dtor;
            unsigned short int compound_order;
#endif
        };

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && USE_SPLIT_PMD_PTLOCKS
        struct {
            unsigned long __pad;    /* do not overlay pmd_huge_pte
                         * with compound_head to avoid
                         * possible bit 0 collision.
                         */
            pgtable_t pmd_huge_pte; /* protected by page->ptl */
        };
#endif
    };

    /* Remainder is not double word aligned */
    union {
        unsigned long private;      /* Mapping-private opaque data:
                         * usually used for buffer_heads
                         * if PagePrivate set; used for
                         * swp_entry_t if PageSwapCache;
                         * indicates order in the buddy
                         * system if PG_buddy is set.
                         * 私有数据指针，由应用场景确定其具体的含义
                         */
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
        spinlock_t *ptl;
#else
        spinlock_t ptl;
#endif
#endif
        struct kmem_cache *slab_cache;  /* SL[AU]B: Pointer to slab */
    };

#ifdef CONFIG_MEMCG
    struct mem_cgroup *mem_cgroup;
#endif

    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;          /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */

#ifdef CONFIG_KMEMCHECK
    /*
     * kmemcheck wants to track the status of each byte in a page; this
     * is a pointer to such a status block. NULL if not tracked.
     */
    void *shadow;
#endif

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
    int _last_cpupid;
#endif
}
/*
 * The struct page can be forced to be double word aligned so that atomic ops
 * on double words work. The SLUB allocator can make use of such a feature.
 */
#ifdef CONFIG_HAVE_ALIGNED_STRUCT_PAGE
    __aligned(2 * sizeof(unsigned long))
#endif
;

在较低版本的内核中，slab没有单独的结构体，全部都是集成到page中的。后续由于空间和可读性的原因将slab单独抽了出来。page结构内部还是有相当多的slab有关字段，以及标识页面所属和属性的标志位。

Buddy System——伙伴系统

attrnba3大佬的博客 https://arttnba3.cn/2021/02/21/OS-0X00-LINUX-KERNEL-PART-I/#三、buddy-system

Buddy System是linux内核以内存页为粒度的一种底层管理机制。在一个zone中会有一个free_area数组用来存储页面，一般大小为11。这个数组就是给buddy system管理用的。

buddy system本质上也是为了解决频繁内存分配的问题而生的。因为其实理论上我们有page后就可以进行分配了，但是这会面临相当多的问题，最重要的一个问题之一就是碎片。碎片分为内部碎片和外部碎片，外部碎片指的是页碎片，分配3个，释放第一个，这样再次分配大于1的页面第一页就永远不会被用到。内部碎片则是指页内碎片，比如我只需要很少的字节却分配4k的页面导致大量内存闲置。而buddy system就是为了解决外部碎片而生的。下面所说的slab/slob则是为了解决内部碎片而生的。

buddy system给出了一种很聪明的管理方案，它将物理内存页按照不同的数量（2的幂次）划分连续的块进行分配并用一个数组进行管理。比如free_area[0]就是单个page连接的链表，而free_area[1]则是每个元素由两个物理连续的page构成的链表，依此类推。

同时，buddy system将页标识为了不同的属性：

enum migratetype {
	MIGRATE_UNMOVABLE,
	MIGRATE_MOVABLE,
	MIGRATE_RECLAIMABLE,
	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
#ifdef CONFIG_CMA
	/*
	 * MIGRATE_CMA migration type is designed to mimic the way
	 * ZONE_MOVABLE works.  Only movable pages can be allocated
	 * from MIGRATE_CMA pageblocks and page allocator never
	 * implicitly change migration type of MIGRATE_CMA pageblock.
	 *
	 * The way to use it is to change migratetype of a range of
	 * pageblocks to MIGRATE_CMA which can be done by
	 * __free_pageblock_cma() function.
	 */
	MIGRATE_CMA,
#endif
#ifdef CONFIG_MEMORY_ISOLATION
	MIGRATE_ISOLATE,	/* can't allocate from here */
#endif
	MIGRATE_TYPES
};

这些属性如下：

不可移动页：在内存中有固定位置，不能移动到其他地方。内核中使用的页大部分是属于这种类型。

可回收页：不能直接移动，但可以删除，页中的内容可以从某些源中重新生成。例如，页内容是映射到文件数据的页就属于这种类型。对于这种类型，在内存短缺(分配失败)时，会发起内存回收，将这类型页进行回写释放。

可移动页：可随意移动，用户空间的进程使用的没有映射具体磁盘文件的页就属于这种类型(比如堆、栈、shmem共享内存、匿名mmap共享内存)，它们是通过进程页表映射的，把这些页复制到新位置时，只要更新进程页表就可以了。一般这些页是从高端内存管理区获取。

buddy system每个free_area元素都有一个free_list[4]链表指针数组，对应了四种不同的页属性。每个链上的属性都一致，每个free_area的连续页数量也都一致。

分配和释放逻辑

在请求空间的时候，buddy system首先会将需求向2的幂次对齐，然后去对应的下标取出连续空间。

匹配属性和大小，找到对应的free_list链表往下摘
如果对应的链表为空，就向下一个free_area去请求，请求到后一分为二，返回一个，将另一个放进上面的链表。再空就继续向上找。
释放后连续的就放到对应链表上（bin），空闲的就合并往更高层放。

而外部碎片的问题只是被很大程度的减弱了，而并没有被完全解决，内核还会执行页面迁移来减少碎片，不再赘述。这个页面迁移的内核接口是很多别的功能的基础

https://blog.csdn.net/yhb1047818384/article/details/119920971
可以发现，页面迁移不是简单的把一个page从A位置移动到B位置，它的本质是一个分配新页面，将旧页面的内容拷贝至新页面，解除旧页面的映射关系，并将映射关系映射到新页面，最后释放旧页面的过程。

Slab/Slub

这是我们利用内核UAF要主要掌握的内容

slab是运行于buddy system之上的一个更小粒度的内存分配系统。用于解决内部碎片的问题。像file和task_struct等等这些频繁使用的小结构体，伙伴系统只能每次申请至少一个page,显然没法满足我们这种需求，slab就出现了。

slab是最早的版本，而这个版本太古早以至于有很多问题，一个很重要的问题就是NUMA架构的支持不好（尽管我们现在都用的是这个架构），非常臃肿。因此被大神们优化成了现在用的slub系统。保留了基本的思想框架，对很多细节和实现做了优化，包括放弃了着色系统，多处理器和NUMA优化等等。但是linux内核中相关的接口都以slab命名，开发者们也都保留了这些名称，因此下面的slab和slub很可能会混用，读者可以自行分辨。

另外，slob则是更针对嵌入式系统的一个内存分配系统

参考：（强烈推荐）：https://segmentfault.com/a/1190000043626203#item-4
https://blog.csdn.net/qq_54218833/article/details/127218102

概述

首先，我们来看slab的一个简单的概述。slab从buddy申请来连续的内存页后，按照对象的大小将其池化成很多小的/内存对齐的对象。它们用kmem_cache管理，链接在链表上，然后充分利用cpu缓存和程序局部性来进行高速的分配和释放

上面这段其实是写完之后回头再写的

slab的管理层级大概是kmem_cache(slab_cache)->slab(freelist/pageframes)->object的逻辑，要具体掌握slab机制比较麻烦，下面按照拆解组成，分配，释放三个部分, 以自顶向下的逻辑大概介绍下slab。（其实自底向上会更好理解一些具体的实现）

拆解组件

kmem_cache

在slab中，它所管理的内存分配单元称为“对象”(Object)。而这些slab则是被slab cache所管理的slab池（层层池化？），用一个kmem_cache结构体管理。
kmem_cache是一个分配器，可以理解成类似main_arena的东西。而从buddy system每次取来的一个或几个连续的页框被称为一个slab，slab分配器将每个slab拆分成若干对象向下一级进行分配。一个kmem_cache对应某一种功能/大小的对象分配，所有的kmem_cache被存放在一个数组中管理。

slub kmem_cache结构（缝合了源码和大佬的部分注释版本，不代表结构体定义，仅供理解）:

随手写一下，染色系统是为了解决多CPU cache缓存不同大小但slab内相同偏移对象时会处于同一行的问题所创立的，但是可能没什么太大用被slub删掉了

struct kmem_cache {
#ifndef CONFIG_SLUB_TINY
	struct kmem_cache_cpu __percpu *cpu_slab; // 关键字段
#endif
    // slab cache 的管理标志位，用于设置 slab 的一些特性
    // 比如：slab 中的对象按照什么方式对齐，对象是否需要 POISON  毒化，是否插入 red zone 在对象内存周围，是否追踪对象的分配和释放信息 等等
    slab_flags_t flags;
    // slab 对象在内存中的真实占用，包括为了内存对齐填充的字节数，red zone 等等
    unsigned int size;  /* The size of an object including metadata */
    // slab 中对象的实际大小，不包含填充的字节数
    unsigned int object_size;/* The size of an object without metadata */
    // slab 对象池中的对象在没有被分配之前，我们是不关心对象里边存储的内容的。
    // 内核巧妙的利用对象占用的内存空间存储下一个空闲对象的地址。
    // offset 表示用于存储下一个空闲对象指针的位置距离对象首地址的偏移
    unsigned int offset;    /* Free pointer offset */
    // 表示 cache 中的 slab 大小，包括 slab 所需要申请的页面个数，以及所包含的对象个数
    // 其中低 16 位表示一个 slab 中所包含的对象总数，高 16 位表示一个 slab 所占有的内存页个数。
    struct kmem_cache_order_objects oo;
    // slab 中所能包含对象以及内存页个数的最大值
    struct kmem_cache_order_objects max;
    // 当按照 oo 的尺寸为 slab 申请内存时，如果内存紧张，会采用 min 的尺寸为 slab 申请内存，可以容纳一个对象即可。
    struct kmem_cache_order_objects min;
    // 向伙伴系统申请内存时使用的内存分配标识
    gfp_t allocflags; 
    // slab cache 的引用计数，为 0 时就可以销毁并释放内存回伙伴系统重
    int refcount;   
    // 池化对象的构造函数，用于创建 slab 对象池中的对象
    void (*ctor)(void *);
    // 对象的 object_size 按照 word 字长对齐之后的大小
    unsigned int inuse;  
    // 对象按照指定的 align 进行对齐
    unsigned int align; 
    // slab cache 的名称， 也就是在 slabinfo 命令中 name 那一列
    const char *name;  

/* 5) statistics */
#ifdef CONFIG_DEBUG_SLAB
	unsigned long num_active;
    ......
    // info used for slab
	int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */

#ifdef CONFIG_KASAN
	struct kasan_cache kasan_info;
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
	unsigned int *random_seq;
#endif
	unsigned int useroffset;	/* Usercopy region offset */
	unsigned int usersize;		/* Usercopy region size */
	struct kmem_cache_node *node[MAX_NUMNODES];
};

有三个字段值得我们关注，一是flag，里面有很多各种各样的配置信息以及flag,比如是否64字节对齐，是否开启slab毒化，是否用red_zone防止OOB, 指定映射区域来自哪里（默认都是NORMAL）等等。

/* DEBUG: Red zone objs in a cache */
#define SLAB_RED_ZONE  ((slab_flags_t __force)0x00000400U)
/* DEBUG: Poison objects */
#define SLAB_POISON  ((slab_flags_t __force)0x00000800U)
/* Align objs on cache lines */
#define SLAB_HWCACHE_ALIGN ((slab_flags_t __force)0x00002000U)
/* Use GFP_DMA memory */
#define SLAB_CACHE_DMA  ((slab_flags_t __force)0x00004000U)
/* Use GFP_DMA32 memory */
#define SLAB_CACHE_DMA32 ((slab_flags_t __force)0x00008000U)
/* DEBUG: Store the last owner for bug hunting */
#define SLAB_STORE_USER

二是开头的__percpu变量
6.11.5版本内核的源码如下：

struct kmem_cache_cpu {
	union {
		struct {
			void **freelist;	   // 指向被 CPU 本地缓存的 slab 中第一个空闲的对象
            // 保证进程在 slab cache 中获取到的 cpu 本地缓存 kmem_cache_cpu 与当前执行进程的 cpu 是一致的。
			unsigned long tid;	/* Globally unique transaction id */
		};
		freelist_aba_t freelist_tid;
	};
    // slab cache 中 CPU 本地所缓存的 slab，
	struct slab *slab;	/* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
    // cpu cache 缓存的备用 slab 列表
    // 当被本地 cpu 缓存的 slab 中没有空闲对象时，内核会从 partial 列表中的 slab 中查找空闲对象
	struct slab *partial;	/* Partially allocated slabs */
#endif
	local_lock_t lock;	/* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
    // 记录 slab 分配对象的一些状态信息
	unsigned int stat[NR_SLUB_STAT_ITEMS];
#endif
};

这个cpu缓存是我们分配的主要来源。主要是里面缓存的一个slab和一个partial列表。后面在分配会讲到他们的用法。freelist将会成为后续我们pwn的对象之一。

最后一个字段。

struct kmem_cache_node {
	spinlock_t list_lock;

#ifdef CONFIG_SLAB
	struct list_head slabs_partial;	/* partial list first, better asm code */
	struct list_head slabs_full;
	struct list_head slabs_free;
	unsigned long total_slabs;	/* length of all slab lists */
	unsigned long free_slabs;	/* length of free slab list only */
	unsigned long free_objects;
	unsigned int free_limit;
	unsigned int colour_next;	/* Per-node cache coloring */
	struct array_cache *shared;	/* shared per node */
	struct alien_cache **alien;	/* on other nodes */
	unsigned long next_reap;	/* updated without locking */
	int free_touched;		/* updated without locking */
#endif

#ifdef CONFIG_SLUB
	unsigned long nr_partial;
	struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
	atomic_long_t nr_slabs;
	atomic_long_t total_objects;
	struct list_head full;
#endif
#endif
};

可以看到，如果使用slab，每个kmem节点要管理三个双向链表，分别是partial,full和free.顾名思义，partial管理有部分空闲对象的slab,full则管理已经全部被分配满的slab，free表示完全空闲的slab. 这三个链表是动态转化的，根据当前的分配情况实时移动。并且free太多会向buddy合并。

而使用slub的时候就直接删除了free和full链表，只保留了partial。

我们可以直接通过cat /proc/slabinfo查看当前所有的slab_cache。

slab

从buddy system取来的页面被拆成一堆object放入了freelist构成我们的一个对象池。用这样一张图可以比较好的表示：

slab是管理内存分配和释放的基本单位。若干个slab也构成了一个池子被slab_cache管理，同时他们向上级buddy申请内存。

对于老一点的内核版本比如5.4，slab并没有单独的结构体进行管理，而是全部存储在page中。但随着内核的发展，我们需要对page结构体进行精简，也就将slab结构单独抽了出来从而减少page结构体的大小。但是page该有的指针什么的还是得有。

linux-5.18.19

struct slab {
	unsigned long __page_flags;

#if defined(CONFIG_SLAB)

	union {
		struct list_head slab_list;
		struct rcu_head rcu_head;
	};
	struct kmem_cache *slab_cache;
	void *freelist;	/* array of free object indexes */
	void *s_mem;	/* first object */
	unsigned int active;

#elif defined(CONFIG_SLUB)

	union {
		struct list_head slab_list;
		struct rcu_head rcu_head;
#ifdef CONFIG_SLUB_CPU_PARTIAL
		struct {
			struct slab *next;
			int slabs;	/* Nr of slabs left */
		};
#endif
	};
	struct kmem_cache *slab_cache;
	/* Double-word boundary */
	void *freelist;		/* first free object */
	union {
		unsigned long counters;
		struct {
			unsigned inuse:16;
			unsigned objects:15;
			unsigned frozen:1;
		};
	};
	unsigned int __unused;

#elif defined(CONFIG_SLOB)

	struct list_head slab_list;
	void *__unused_1;
	void *freelist;		/* first free block */
	long units;
	unsigned int __unused_2;

#else
#error "Unexpected slab allocator configured"
#endif

	atomic_t __page_refcount;
#ifdef CONFIG_MEMCG
	unsigned long memcg_data;
#endif
};

freelist直接指向当前的空闲对象，每个空闲对象则同时以链表的形式相互连接。slab不单独维护空闲的对象，他们都是同时连续存放的，如果全free了就直接放进kmem_cache的free,只有partial的时候会用freelist->object.next这种形式连接一个内部空闲对象链表。

slab其实本身没啥内容，大头都让kmem_cache干了。

旧版的slab采用一个描述符数组来管理对象的分配和释放，而且slab内同样维护了

Object

每个Object是由slab去分配，真正返回给内核程序使用的。只不过和chunk相比，Object似乎管理和结构都既更加简单又更加复杂。

说更加简单，是因为他们不需要像bin索引这样的方法，没有fdbk那么多东西，都是线性映射。顶天了也就是partial的遍历以下，直接slub用一个free_list指针指到没分配的对象，取的时候几乎是一键返回。

说更加复杂，是因为在内核中还有其他的奇怪机制（部分是可配置的）。比如防止越界读写的red_zone, 使用特殊字符填充object的slab_poison等等。前者是会在object两侧空余空间防越界读写（和内存对齐结合起来利用），后者则是用0x6b填充对象并以0x5a结尾，在它们释放或者刚从buddy取出的时候。另外还需要其他的track信息来附加在object末尾，因此最后一个object大概是这样的结构：

分配与释放（简化概述）

https://segmentfault.com/a/1190000043626203#item-6
写得很好，我只是自己理解后抄一遍罢了，很难想象理解这么深刻的大佬能写和画的这么直观

分配

从cpu缓存中直接分配，kmem_cpu_cache->slab取，我们称为快速路径，如果有空闲对象freelist直接取出对象返回
freelist为空，kmem_cpu_cache->slab已经满，则从cpu缓存partial中分配，我们称为慢速路径，需要遍历查看，找到能分配的slab后将该slab提升至cache并分配
partial也被取完了，则要返回kmem_cache结构，从node的partial链表中找，遍历一遍，同样找到后提升至cpu_cache, 并且将剩余的slab全部连接到cpu_cache->partial下（数量有限制）
全空了（比如刚创建），从buddy申请，依据kmem_cache的字段去要，要到slab直接提升到cpu缓存
拿到后会经过池化，slab poison等等细节操作不再赘述

释放

如果释放的对象在cpu_cache->slab中，那就直接放回该slab并修正freelist指针，我们称为快速路径
如果释放的对象在cpu_cache->partial中，也是直接放回然后修改free_list和pointer
如果释放的对象在kmem_cache_node->partial中，也是直接放回
如果对象释放后，原本的full变为了partial，且不在cpu本地缓存中，那么内核会将这个slab重新插入到cpu_cache的partial链表中。

因为 slab 之前之所以是一个 full slab，恰恰证明了该 slab 是一个非常活跃的 slab，常常供不应求导致变成了一个 full slab，当对象释放之后，刚好变成 partial slab，这时需要将这个被频繁访问的 slab 放入 cpu 缓存中，加快下次分配对象的速度。

当然，cpu_cache是很宝贵的，我们不能什么都往里塞。kmem_cache->cpu_partial规定了一个数量，超过的话就会将所有的cpu partial转移到kmem_node的partial中。这个检查是第一位的。这也是有说法的：

CPU partial爆炸的时候，说明内核当前所处的场景是一个内存释放频繁的场景。kmem_cache_cpu->partial 链表太满了，而内存分配的请求又不是很多，kmem_cache_cpu 中缓存的 slab 并不会频繁的消耗。这样一来，就需要将链表中的所有 slab 一次性转移到 NUMA 节点缓存 partial 链表中备用。

如果对象释放后，slab从partial变成了empty，内核会将该slab插入节点缓存，也就是kmem_cache_node->partial中

舍弃了slab中的empty，全放进partial的话就需要一些流程。kmem_cache->min_partial中规定了node中缓存的slab个数上限。partial超过这个值会将所有的empty slab回收至buddy system.这个检查也是插入前优先进行

总结

至此，slab/slub的架构如下：

Next Post

CHOP——强网杯2024expect_number wp
Previous Post

Algorithm(0)

CATALOG

1. 内核视角下的“主存”
2. Buddy System——伙伴系统
1. 2.1. 分配和释放逻辑
3. Slab/Slub