Understanding Swap in Linux Kernel

原创文章,转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: Understanding Swap in Linux Kernel

Agenda

1. 背景

2. swap设备初始化

3. swap slot分配

4. swap cache, page和swap slot的关系

5. 换出

6. 换入

7. 回收swap slot.


1. 背景

当系统中空闲的物理内存不足的时候,需要回收一些正在使用的内存。回收内存主要集中在用户进程的页缓存和匿名内存。页缓存是用来缓存存储设备上的文件内容,而匿名内存是用于缓存堆,栈等内容。当系统中可用的空闲内存不足时,可以把页缓存的内容释放掉或者写入存储设备,这样就释放出了可用的内存。而匿名内存由于它并不和磁盘上的文件对应,因此需要提供一个swap设备用来将匿名内存的内容腾到swap设备上来释放出可用的内存。swap设备可以是一个文件,也可以是一个磁盘分区。本文试图把swap的管理讲清楚。由于swap是属于内存回收的一部分,因此内存回收系统的介绍可以在之前的博客中找到 认识物理内存回收机制以及 How to swap out the anonymous page?。事实上,除了匿名内存会被swap处理外,还有另外两类内存。a).  Dirty pages that belong to a private memory mapping of a process; b). Pages that belong to an IPC shared memory region[1]

当一个页被swap换到swap设备上后,虚拟地址对应的页表项的值需要改写到对应的swap设备上。这个值对于32位系统来说,一共有32位,其中低7位中的高5位(第2~6位)来表示该页的内容被换出到哪一个设备上,第7位~第31位用来表示在swap设备上的offset, 一般情况下若在内存中物理页是4KB,而磁盘上一个磁盘块也是4KB的话,只会使用到高20位,即第11位~31位(4KB对齐嘛)。这样当MMU读取到某个页表项时,发现第0位是0,就会产生一个MMU Fault,然后Kernel处理该Fault,发现页表项其余31位不为0,则说明该缺页的内容在swap上,进行Swap换入操作。

当swap设备中的空间比较紧张的时候(匿名内存换出达到一定数量后),swap设备便需要进行回收,以腾出更多的swap空间用于swap换出。swap回收的主要是那些之前已经在swap中存在,但后来某个进程又将swap的内容换入了,但swap的空间却没有及时释放的情况。

2. Swap设备的初始化

Swap设备可以是一个磁盘分区,也可以是一个文件,他们的启用通过swapon系统调用 。对于文件来说,由于文件才磁盘中对应的磁盘块不一定连续,因此对于文件的swap设备管理起来稍微有点复杂。每个swap设备都对应一个swap_info_struct. 系统默认可以支持32个设备。swap设备的初始化可以简化为与该设备对应的swap_info_struct的初始化。

由于系统可以支持多个swap设备,那么当执行swap换出的时候,到底选择哪一个swap设备呢?每一个swap设备都被设置了一个优先级,数值越大,优先级越高。当swap换出的时候,选择优先级最高的swap设备进行换出。

swap设备在kernel中对应的数据结构是:

178 /*
179  * The in-memory structure used to track swap areas.
180  */
181 struct swap_info_struct {
182         unsigned long   flags;          /* SWP_USED etc: see above */
183         signed short    prio;           /* swap priority of this type */
184         signed char     type;           /* strange name for an index */
185         signed char     next;           /* next type on the swap list */
186         unsigned int    max;            /* extent of the swap_map */
187         unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
188         unsigned int lowest_bit;        /* index of first free in swap_map */
189         unsigned int highest_bit;       /* index of last free in swap_map */
190         unsigned int pages;             /* total of usable pages of swap */
191         unsigned int inuse_pages;       /* number of those currently in use */
192         unsigned int cluster_next;      /* likely index for next allocation */
193         unsigned int cluster_nr;        /* countdown to next cluster search */
194         unsigned int lowest_alloc;      /* while preparing discard cluster */
195         unsigned int highest_alloc;     /* while preparing discard cluster */
196         struct swap_extent *curr_swap_extent;
197         struct swap_extent first_swap_extent;
198         struct block_device *bdev;      /* swap device or bdev of swap file */
199         struct file *swap_file;         /* seldom referenced */
200         unsigned int old_block_size;    /* seldom referenced */
201 #ifdef CONFIG_FRONTSWAP
202         unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
203         atomic_t frontswap_pages;       /* frontswap pages in-use counter */
204 #endif
205         spinlock_t lock;                /*
206                                          * protect map scan related fields like
207                                          * swap_map, lowest_bit, highest_bit,
208                                          * inuse_pages, cluster_next,
209                                          * cluster_nr, lowest_alloc and
210                                          * highest_alloc. other fields are only
211                                          * changed at swapon/swapoff, so are
212                                          * protected by swap_lock. changing
213                                          * flags need hold this lock and
214                                          * swap_lock. If both locks need hold,
215                                          * hold swap_lock first.
216                                          */
217 };

该结构的每一个成员字段都有对应的解释。其中重点说一下,该结构中的flags字段说明该swap的一些属性。他们的解释如下

146 enum {
147         SWP_USED        = (1 << 0),     /* is slot in swap_info[] used? */
148         SWP_WRITEOK     = (1 << 1),     /* ok to write to this swap?    */
149         SWP_DISCARDABLE = (1 << 2),     /* swapon+blkdev support discard */
150         SWP_DISCARDING  = (1 << 3),     /* now discarding a free cluster */
151         SWP_SOLIDSTATE  = (1 << 4),     /* blkdev seeks are cheap */
152         SWP_CONTINUED   = (1 << 5),     /* swap_map has count continuation */
153         SWP_BLKDEV      = (1 << 6),     /* its a block device */
154         SWP_FILE        = (1 << 7),     /* set after swap_activate success */
155                                         /* add others here before... */
156         SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
157 };

SWP_USED 说明该swap设备可用。

SWP_WRITEOK 说明该swap设备可以写入,每一个swap设备在enable后都是可以写入的。

SWP_DISCARDABLE说明该设备是否支持discard(关于discard,请查阅man swapon中的—discard参数以及参考[2]),一般情况下只用于SSD,并且如果SSD支持TRIM或者discard的话性能会有所提高。

SWP_SOLIDSTATE 对于SSD设备的上的swap file,需要设置该位。这在查找free swap page的时候会用到。

SWP_CONTINUED 的用法,请看swap回收中的对应描述。

SWP_BLKDEV, SWP_FILE 是说明该swap设备是一个分区,还是一个文件。

SWP_SCANNING,当系统在某个swap中寻找空闲的页槽时,便会置上该位,查找结束后清掉。

另外,swap_map 是一个数组,每一个swap slot对应一个字节的map,用于计算该slot的使用计数。

3. Swap槽的分配

当有匿名页需要换出的时候,便需要现在swap设备上分配一个空闲的槽。分配槽大致分为两步,第一步是定位swap设备,第二步从swap设备上寻找一个空闲的槽。

关于定位swap设备,系统定义了一个swap_list数据结构来辅助定位。swap_list有两个成员,一个是head,另一个是next。无论何时,在查找swap设备的时候都是从swap_list.next开始的。

416 swp_entry_t get_swap_page(void)
417 {
418         struct swap_info_struct *si;
419         pgoff_t offset;
420         int type, next;
421         int wrapped = 0;
422         int hp_index;
423
424         spin_lock(&swap_lock);
425         if (atomic_long_read(&nr_swap_pages) <= 0)
426                 goto noswap;
427         atomic_long_dec(&nr_swap_pages);//Available Pages in swap device.
428
429         for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
430                 hp_index = atomic_xchg(&highest_priority_index, -1);//Store -1, and return previous value.
431                 /*
432                  * highest_priority_index records current highest priority swap
433                  * type which just frees swap entries. If its priority is
434                  * higher than that of swap_list.next swap type, we use it.  It
435                  * isn't protected by swap_lock, so it can be an invalid value
436                  * if the corresponding swap type is swapoff. We double check
437                  * the flags here. It's even possible the swap type is swapoff
438                  * and swapon again and its priority is changed. In such rare
439                  * case, low prority swap type might be used, but eventually
440                  * high priority swap will be used after several rounds of
441                  * swap.
442                  */
443                 if (hp_index != -1 && hp_index != type &&
444                     swap_info[type]->prio < swap_info[hp_index]->prio &&
445                     (swap_info[hp_index]->flags & SWP_WRITEOK)) {
446                         type = hp_index;
447                         swap_list.next = type;
448                 }
449
450                 si = swap_info[type];
451                 next = si->next;
452                 if (next < 0 ||
453                     (!wrapped && si->prio != swap_info[next]->prio)) {
454                         next = swap_list.head;
455                         wrapped++;
456                 }
457
458                 spin_lock(&si->lock);
459                 if (!si->highest_bit) {
460                         spin_unlock(&si->lock);
461                         continue;
462                 }
463                 if (!(si->flags & SWP_WRITEOK)) {
464                         spin_unlock(&si->lock);
465                         continue;
466                 }
467
468                 swap_list.next = next;
469
470                 spin_unlock(&swap_lock);
471                 /* This is called for allocating swap entry for cache */
472                 offset = scan_swap_map(si, SWAP_HAS_CACHE);
473                 spin_unlock(&si->lock);
474                 if (offset)
475                         return swp_entry(type, offset);
476                 spin_lock(&swap_lock);
477                 next = swap_list.next;
478         }
479
480         atomic_long_inc(&nr_swap_pages);
481 noswap:
482         spin_unlock(&swap_lock);
483         return (swp_entry_t) {0};
484 }

首先是要定位到合适的swap file或者swap 设备。所有的可用swap 都枚举在数组swap_info[]中。但第一个优先选用的swap file的index会记录在swap_list的next成员中。因此,第429行的for循环是从swap_list.next开始的。系统中还有一个全局的变量是highest_priority_index,记录了系统中优先级最好的swap file. 那么在swap_list.next的file和highest_priority_index该选择哪一个呢?就看谁的优先级最高就选哪一个(见第443行~445行的比较)。

接着,对选出来的swap file进行进一步的验证,有空间可以分配(459~462行),该swap是可以写入的(463~466行),同时设置swap_list.next指针。以使swap的分布在各个swap上尽量的均匀。

当swap file选出来以后,就需要通过scan_swap_map在该swap file上分配一个空闲的slot了。scan_swap_map中的参数usage,设置为SWAP_HAS_CACHE说明该swap对应的物理页也在内存中存在于address_space中。如果该swap页槽没有被address_space引用,该标志会被清掉。

get_swap_page-> scan_swap_map:

187 static unsigned long scan_swap_map(struct swap_info_struct *si,
188                                    unsigned char usage)
189 {
190         unsigned long offset;
191         unsigned long scan_base;
192         unsigned long last_in_cluster = 0;
193         int latency_ration = LATENCY_LIMIT;
194         int found_free_cluster = 0;
195
196         /*
197          * We try to cluster swap pages by allocating them sequentially
198          * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
199          * way, however, we resort to first-free allocation, starting
200          * a new cluster.  This prevents us from scattering swap pages
201          * all over the entire swap partition, so that we reduce
202          * overall disk seek times between swap pages.  -- sct
203          * But we do now try to find an empty cluster.  -Andrea
204          * And we let swap pages go all over an SSD partition.  Hugh
205          */
206
207         si->flags += SWP_SCANNING;
208         scan_base = offset = si->cluster_next;
209
210         if (unlikely(!si->cluster_nr--)) {
211                 if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
212                         si->cluster_nr = SWAPFILE_CLUSTER - 1;
213                         goto checks;
214                 }
215                 if (si->flags & SWP_DISCARDABLE) {
216                         /*
217                          * Start range check on racing allocations, in case
218                          * they overlap the cluster we eventually decide on
219                          * (we scan without swap_lock to allow preemption).
220                          * It's hardly conceivable that cluster_nr could be
221                          * wrapped during our scan, but don't depend on it.
222                          */
223                         if (si->lowest_alloc)
224                                 goto checks;
225                         si->lowest_alloc = si->max;
226                         si->highest_alloc = 0;
227                 }
228                 spin_unlock(&si->lock);
229
230                 /*
231                  * If seek is expensive, start searching for new cluster from
232                  * start of partition, to minimize the span of allocated swap.
233                  * But if seek is cheap, search from our current position, so
234                  * that swap is allocated from all over the partition: if the
235                  * Flash Translation Layer only remaps within limited zones,
236                  * we don't want to wear out the first zone too quickly.
237                  */
238                 if (!(si->flags & SWP_SOLIDSTATE))
239                         scan_base = offset = si->lowest_bit;
240                 last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
241
242                 /* Locate the first empty (unaligned) cluster */
243                 for (; last_in_cluster <= si->highest_bit; offset++) {
244                         if (si->swap_map[offset])
245                                 last_in_cluster = offset + SWAPFILE_CLUSTER;
246                         else if (offset == last_in_cluster) {
247                                 spin_lock(&si->lock);
248                                 offset -= SWAPFILE_CLUSTER - 1;
249                                 si->cluster_next = offset;
250                                 si->cluster_nr = SWAPFILE_CLUSTER - 1;
251                                 found_free_cluster = 1;
252                                 goto checks;
253                         }
254                         if (unlikely(--latency_ration < 0)) {
255                                 cond_resched();
256                                 latency_ration = LATENCY_LIMIT;
257                         }
258                 }
259
260                 offset = si->lowest_bit;
261                 last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
262
263                 /* Locate the first empty (unaligned) cluster */
264                 for (; last_in_cluster < scan_base; offset++) {
265                         if (si->swap_map[offset])
266                                 last_in_cluster = offset + SWAPFILE_CLUSTER;
267                         else if (offset == last_in_cluster) {
268                                 spin_lock(&si->lock);
269                                 offset -= SWAPFILE_CLUSTER - 1;
270                                 si->cluster_next = offset;
271                                 si->cluster_nr = SWAPFILE_CLUSTER - 1;
272                                 found_free_cluster = 1;
273                                 goto checks;
274                         }
275                         if (unlikely(--latency_ration < 0)) {
276                                 cond_resched();
277                                 latency_ration = LATENCY_LIMIT;
278                         }
279                 }
280
281                 offset = scan_base;
282                 spin_lock(&si->lock);
283                 si->cluster_nr = SWAPFILE_CLUSTER - 1;
284                 si->lowest_alloc = 0;
285         }
286
287 checks:
288         if (!(si->flags & SWP_WRITEOK))
289                 goto no_page;
290         if (!si->highest_bit)
291                 goto no_page;
292         if (offset > si->highest_bit)
293                 scan_base = offset = si->lowest_bit;
294
295         /* reuse swap entry of cache-only swap if not busy. */
296         if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
297                 int swap_was_freed;
298                 spin_unlock(&si->lock);
299                 swap_was_freed = __try_to_reclaim_swap(si, offset);
300                 spin_lock(&si->lock);
301                 /* entry was freed successfully, try to use this again */
302                 if (swap_was_freed)
303                         goto checks;
304                 goto scan; /* check next one */
305         }
306
307         if (si->swap_map[offset])
308                 goto scan;
309
310         if (offset == si->lowest_bit)
311                 si->lowest_bit++;
312         if (offset == si->highest_bit)
313                 si->highest_bit--;
314         si->inuse_pages++;
315         if (si->inuse_pages == si->pages) {
316                 si->lowest_bit = si->max;
317                 si->highest_bit = 0;
318         }
319         si->swap_map[offset] = usage;
320         si->cluster_next = offset + 1;
321         si->flags -= SWP_SCANNING;
322
323         if (si->lowest_alloc) {
324                 /*
325                  * Only set when SWP_DISCARDABLE, and there's a scan
326                  * for a free cluster in progress or just completed.
327                  */
328                 if (found_free_cluster) {
329                         /*
330                          * To optimize wear-levelling, discard the
331                          * old data of the cluster, taking care not to
332                          * discard any of its pages that have already
333                          * been allocated by racing tasks (offset has
334                          * already stepped over any at the beginning).
335                          */
336                         if (offset < si->highest_alloc &&
337                             si->lowest_alloc <= last_in_cluster)
338                                 last_in_cluster = si->lowest_alloc - 1;
339                         si->flags |= SWP_DISCARDING;
340                         spin_unlock(&si->lock);
341
342                         if (offset < last_in_cluster)
343                                 discard_swap_cluster(si, offset,
344                                         last_in_cluster - offset + 1);
345
346                         spin_lock(&si->lock);
347                         si->lowest_alloc = 0;
348                         si->flags &= ~SWP_DISCARDING;
349
350                         smp_mb();       /* wake_up_bit advises this */
351                         wake_up_bit(&si->flags, ilog2(SWP_DISCARDING));
352
353                 } else if (si->flags & SWP_DISCARDING) {
354                         /*
355                          * Delay using pages allocated by racing tasks
356                          * until the whole discard has been issued. We
357                          * could defer that delay until swap_writepage,
358                          * but it's easier to keep this self-contained.
359                          */
360                         spin_unlock(&si->lock);
361                         wait_on_bit(&si->flags, ilog2(SWP_DISCARDING),
362                                 wait_for_discard, TASK_UNINTERRUPTIBLE);
363                         spin_lock(&si->lock);
364                 } else {
365                         /*
366                          * Note pages allocated by racing tasks while
367                          * scan for a free cluster is in progress, so
368                          * that its final discard can exclude them.
369                          */
370                         if (offset < si->lowest_alloc)
371                                 si->lowest_alloc = offset;
372                         if (offset > si->highest_alloc)
373                                 si->highest_alloc = offset;
374                 }
375         }
376         return offset;
377
378 scan:
379         spin_unlock(&si->lock);
380         while (++offset <= si->highest_bit) {
381                 if (!si->swap_map[offset]) {
382                         spin_lock(&si->lock);
383                         goto checks;
384                 }
385                 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
386                         spin_lock(&si->lock);
387                         goto checks;
388                 }
389                 if (unlikely(--latency_ration < 0)) {
390                         cond_resched();
391                         latency_ration = LATENCY_LIMIT;
392                 }
393         }
394         offset = si->lowest_bit;
395         while (++offset < scan_base) {
396                 if (!si->swap_map[offset]) {
397                         spin_lock(&si->lock);
398                         goto checks;
399                 }
400                 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
401                         spin_lock(&si->lock);
402                         goto checks;
403                 }
404                 if (unlikely(--latency_ration < 0)) {
405                         cond_resched();
406                         latency_ration = LATENCY_LIMIT;
407                 }
408         }
409         spin_lock(&si->lock);
410
411 no_page:
412         si->flags -= SWP_SCANNING;
413         return 0;
414 }

空槽的查找是在某一个cluster中的。每一个cluster有255个空槽。当槽中的slot都分配完后,就需要再查找一个空的cluster了。210~285行,便是寻找新的cluster的过程。243~258行如果到了highest_bit找不下空的cluster.就从lowest_bit开始再找一遍(263~279行)。如果还找不下,那就scan处,从头到尾扫描(380~393)然后遇到SWAP_HAS_CACHE的槽,就进行回收(385行)。

243~258行,试图寻找到一个有255个空槽的cluster,这样以后的分配都在该cluster中了。323行~376行,还专门针对支持TRIM特性的SSD做了一个优化。当找到一个空的cluster后,就对该cluster进行一个Discard.

总之,当某个swap_map 的entry为空,就说明对应的slot可供分配。

4. Swap Cache,Swap槽和Page的关系

swap cache 并不是真的cache,而是一种log方式。为什么要有swap cache呢? 看一种情况:当在物理内存中的一个物理页被多个进程引用的时候,如果要被换出,那么该物理页在多个进程中的页表项就都需要改写为swap slot的地址。当有某个进程要使用被换出的物理页时,就会在物理内存中分配一页,然后将swap 中的内容copy过去。这时候,其他指向swap slot的进程的页表项里放的仍然是swap slot的地址。如果其他进程需要使用swap slot的内容了,怎么办? 当然是共用之前进程已经swap in的物理内存,然后swap slot对应的计数减1就可以了。其他进程怎么知道swap slot的内容已经在物理内存中有copy呢?就是依靠swap cache.

更详细的解释,请参考 Understanding the Linux Kernel 3rd, 17.4.6 Chapter The Swap Cache 一节.

5. 换出(Swap Out)

换出的操作,即,在内存回收中,需要对某些暂时不会被用到的匿名页进行释放以得到更多的空闲物理内存,匿名页的内容暂存在swap device/file上。所以换出操作包含两步,首先需要找到暂时不会被用到的匿名页,其次,对这些暂时不用的匿名页进行换出。

查找匿名页采用的是近似的LRU算法。具体可参考之前对内存回收的分析博客。

对匿名页进行换出,需要现在swap device/file上查找一块空闲的space,然后将该页置位dirty状态,挂载到读写块队列中,利用磁盘驱动进行写入。在swap disk/file上查找空闲space在上一节的swap 分配中已经分析过了。而页内容写到磁盘属于块设备读写内容,不做过多分析。这里需要注意的一点是,假若一个页被多个进程引用,那么在写出时,怎么能保证被多个进程引用的物理页,只被写出一次,并写到同一个swap entry上呢。另外,做在换入操作时,当某个A进程已经将在swap页读入到内存中了,而其他共享该页的进程B的页表项仍然指向swap entry,当B进程需要访问在swap entry的页时,怎么能保证不再重复读A已经读进来的swap页呢?这两个问题的解决依靠address_space. address_space是一个基数树。每一个swap device/file都有一个address_space。这个基数树中,key是swap_entry,而数据是指向物理页的指针。关于swap_entry,物理页以及address_space的关系,请看swap_entry, physical page和 address_space一节。

当找到一个free的swap slot后,就要进行换出

 77 /*
 78  * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
 79  * but sets SwapCache flag and private instead of mapping and index.
 80  */
 81 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 82 {
 83         int error;
 84         struct address_space *address_space;
 85
 86         VM_BUG_ON(!PageLocked(page));
 87         VM_BUG_ON(PageSwapCache(page));
 88         VM_BUG_ON(!PageSwapBacked(page));
 89
 90         page_cache_get(page);
 91         SetPageSwapCache(page);
 92         set_page_private(page, entry.val);
 93
 94         address_space = swap_address_space(entry);
 95         spin_lock_irq(&address_space->tree_lock);
 96         error = radix_tree_insert(&address_space->page_tree,
 97                                         entry.val, page);
 98         if (likely(!error)) {
 99                 address_space->nrpages++;
100                 __inc_zone_page_state(page, NR_FILE_PAGES);
101                 INC_CACHE_INFO(add_total);
102         }
103         spin_unlock_irq(&address_space->tree_lock);
104
105         if (unlikely(error)) {
106                 /*
107                  * Only the context which have set SWAP_HAS_CACHE flag
108                  * would call add_to_swap_cache().
109                  * So add_to_swap_cache() doesn't returns -EEXIST.
110                  */
111                 VM_BUG_ON(error == -EEXIST);
112                 set_page_private(page, 0UL);
113                 ClearPageSwapCache(page);
114                 page_cache_release(page);
115         }
116
117         return error;
118 }

对于换出的page,首先需要对该page的使用计数加1(90行),然后设置PG_swapcache标志,最后设置page->private的value为swap_entry的值(在swap disk/file中的offset). 最终会插入到address_space的基数树中。

6. 换入(Swap In)

当一个进程中的某个物理页被换出到swap area后,如果再次访问便会发生缺页中断。缺页中断的处理函数如果发现页表项中得值非零,但invalid。就认为需要进行swap in了。swap in的对应处理函数是:do_swap_page。

handle_pte_fault->do_swap_page:

2994 /*
2995  * We enter with non-exclusive mmap_sem (to exclude vma changes,
2996  * but allow concurrent faults), and pte mapped but not yet locked.
2997  * We return with mmap_sem still held, but pte unmapped and unlocked.
2998  */
2999 static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
3000                 unsigned long address, pte_t *page_table, pmd_t *pmd,
3001                 unsigned int flags, pte_t orig_pte)
3002 {
3003         spinlock_t *ptl;
3004         struct page *page, *swapcache;
3005         swp_entry_t entry;
3006         pte_t pte;
3007         int locked;
3008         struct mem_cgroup *ptr;
3009         int exclusive = 0;
3010         int ret = 0;
3011 
3012         if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
3013                 goto out;
3014 
3015         entry = pte_to_swp_entry(orig_pte);
3016         if (unlikely(non_swap_entry(entry))) {
3017                 if (is_migration_entry(entry)) {
3018                         migration_entry_wait(mm, pmd, address);
3019                 } else if (is_hwpoison_entry(entry)) {
3020                         ret = VM_FAULT_HWPOISON;
3021                 } else {
3022                         print_bad_pte(vma, address, orig_pte, NULL);
3023                         ret = VM_FAULT_SIGBUS;
3024                 }
3025                 goto out;
3026         }
3027         delayacct_set_flag(DELAYACCT_PF_SWAPIN);
3028         page = lookup_swap_cache(entry);
3029         if (!page) {
3030                 page = swapin_readahead(entry, 
3031                                         GFP_HIGHUSER_MOVABLE, vma, address);
…...
3117         pte = mk_pte(page, vma->vm_page_prot);
3118         if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
3119                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
3120                 flags &= ~FAULT_FLAG_WRITE;
3121                 ret |= VM_FAULT_WRITE;
3122                 exclusive = 1;
3123         }
3124         flush_icache_page(vma, page);
3125         if (pte_swp_soft_dirty(orig_pte))
3126                 pte = pte_mksoft_dirty(pte);
3127         set_pte_at(mm, address, page_table, pte);
3128         if (page == swapcache)
3129                 do_page_add_anon_rmap(page, vma, address, exclusive);
3130         else /* ksm created a completely new copy */
3131                 page_add_new_anon_rmap(page, vma, address);
3132         /* It's better to call commit-charge after rmap is established */
3133         mem_cgroup_commit_charge_swapin(page, ptr);
3134 
3135         swap_free(entry);
3136         if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3137                 try_to_free_swap(page);
3138         unlock_page(page);
3139         if (page != swapcache) {
3140                 /*
3141                  * Hold the lock to avoid the swap entry to be reused
3142                  * until we take the PT lock for the pte_same() check
3143                  * (to avoid false positives from pte_same). For
3144                  * further safety release the lock after the swap_free
3145                  * so that the swap count won't change under a
3146                  * parallel locked swapcache.
3147                  */
3148                 unlock_page(swapcache);
3149                 page_cache_release(swapcache);
3150         }
3151 
3152         if (flags & FAULT_FLAG_WRITE) {
3153                 ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
3154                 if (ret & VM_FAULT_ERROR)
3155                         ret &= VM_FAULT_ERROR;
3156                 goto out;
3157         }
3158 
3159         /* No need to invalidate - it was non-present before */
3160         update_mmu_cache(vma, address, page_table);
3161 unlock:
3162         pte_unmap_unlock(page_table, ptl);
3163 out:
3164         return ret;
3165 out_nomap:
3166         mem_cgroup_cancel_charge_swapin(ptr);
3167         pte_unmap_unlock(page_table, ptl);
3168 out_page:
3169         unlock_page(page);
3170 out_release:
3171         page_cache_release(page);
3172         if (page != swapcache) {
3173                 unlock_page(swapcache);
3174                 page_cache_release(swapcache);
3175         }
3176         return ret;
3177 }

3012行是为了避免重复处理,若该Pte已经被其他进程或者其他core处理了,就不用再处理了。

3015行,根据pte获得其swap slot地址。

3016行~3026行是做错误或者其他检查。

3027行是设置该线程对应的struct tast_struct的task_delay_info.flags,我正在做swap in 呢. 3054 4062行是其对照,取消掉标志。说明已经做完了。

由于某个物理页可能被多个进程引用,因此,当该物理页被其他进程已经copy到物理内存中,即能够在该swap area的swap cache中查找到该页(3028行),就不需要再次读入了。若没有在swap cache中,就需要驱动存储控制器,读入swap area中对应的内容到物理内存中(3030行)。当这些事情都做完后,就更新页表(3117和3127行)。其次,要对swap area中对应slot的引用计数进行维护(3135行)。

7. 回收Swap

当swap disk/file 比较满的时候,就需要回收一些空间了。swap区满的判断标准是:

/* Swap 50% full? Release swapcache more aggressively.. */
408 static inline bool vm_swap_full(void)
409 {
410         return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
411 }

可以看出,只要达到了50%的swap空间,就认为swap 区需要回收一些空间了,那么问题来了,哪些被分配出去的swap可以被回收? 能满足以下条件的,可以被回收:

1. 该swap entry在swapcache中存在(PG_swapcache被置位)。即,物理内存中同样有一份该swap entry中内容的copy。

2. 页没有被回写(PG_writeback没有被置位),说明该swap entry在物理内存中存在的copy是后来被换入的,而不是正在被换出的。

3. 该swap entry的引用计数是0,即该swap entry已经不被任何进程所引用。这一点,容易和第一点相矛盾。 注意,在释放该swap space之前,已经把该swap entry从address space中删掉了,但PG_swapcache仍然置位。

当一个swap entry从address_space(也就是swapcache)中被删除后,

862 /*
 863  * Called after dropping swapcache to decrease refcnt to swap entries.
 864  */
 865 void swapcache_free(swp_entry_t entry, struct page *page)
 866 {
 867         struct swap_info_struct *p;
 868         unsigned char count;
 869
 870         p = swap_info_get(entry);
 871         if (p) {
 872                 count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
 873                 if (page)
 874                         mem_cgroup_uncharge_swapcache(page, entry, count != 0);
 875                 spin_unlock(&p->lock);
 876         }
 877 }

swapcache_free->swap_entry_free:

 前面已经把swap_entry从swapcache的基数树中删除了,因此下面需要把该swap_entry在swapmap[]中对应的计数减一。由于之前swap_entry存在在swapcache中,因此swapmap[]中对应的计数count的SWAP_HAS_CACHE被置位,所以,swapcache_free中调用swap_entry_free也需要传入SWAP_HAS_CACHE参数。下文会看到,若传入SWAP_HAS_CACHE参数,那么swap_entry在swap_map中对应的SWAP_HAS_CACHE一定被置位(800~803行)。

789 static unsigned char swap_entry_free(struct swap_info_struct *p,
 790                                      swp_entry_t entry, unsigned char usage)
 791 {
 792         unsigned long offset = swp_offset(entry);
 793         unsigned char count;
 794         unsigned char has_cache;
 795
 796         count = p->swap_map[offset];
 797         has_cache = count & SWAP_HAS_CACHE;
 798         count &= ~SWAP_HAS_CACHE;
 799
 800         if (usage == SWAP_HAS_CACHE) {
 801                 VM_BUG_ON(!has_cache);
 802                 has_cache = 0;
 803         } else if (count == SWAP_MAP_SHMEM) {
 804                 /*
 805                  * Or we could insist on shmem.c using a special
 806                  * swap_shmem_free() and free_shmem_swap_and_cache()...
 807                  */
 808                 count = 0;
 809         } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
 810                 if (count == COUNT_CONTINUED) {
 811                         if (swap_count_continued(p, offset, count))
 812                                 count = SWAP_MAP_MAX | COUNT_CONTINUED;
 813                         else
 814                                 count = SWAP_MAP_MAX;
 815                 } else
 816                         count--;
 817         }
 818
 819         if (!count)
 820                 mem_cgroup_uncharge_swap(entry);
 821
 822         usage = count | has_cache;
 823         p->swap_map[offset] = usage;
 824
 825         /* free if no reference */
 826         if (!usage) {
 827                 dec_cluster_info_page(p, p->cluster_info, offset);
 828                 if (offset < p->lowest_bit)
 829                         p->lowest_bit = offset;
 830                 if (offset > p->highest_bit)
 831                         p->highest_bit = offset;
 832                 set_highest_priority_index(p->type);
 833                 atomic_long_inc(&nr_swap_pages);
 834                 p->inuse_pages--;
 835                 frontswap_invalidate_page(p->type, offset);
 836                 if (p->flags & SWP_BLKDEV) {
 837                         struct gendisk *disk = p->bdev->bd_disk;
 838                         if (disk->fops->swap_slot_free_notify)
 839                                 disk->fops->swap_slot_free_notify(p->bdev,
 840                                                                   offset);
 841                 }
 842         }
 843
 844         return usage;
 845 }

从826行,可以看出,若在swap device/file中,某个swap slot的引用计数为0,就可以被free了。那么一个slot的引用计数如前所述,在每一个swap设备或者swap file中都会有一个swap_map. swap_map是一个unsigned char* 的数组。swap设备中的每个页顺序对应该数组的一项,该数组用来记录对应页的使用计数。比如一个swap entry有n个PTE映射,那么就需要计数为N。系统规定了计数的最大值是0x7f,即十进制数127.那么计数超过127怎么办呢,系统就分配一页用作swap_map的延续。然后在延续的swap_map上对应的offset处再重新计数,并把上一个offset的值设置为0x80。对应的patch(570a335b8):

+/*
+ * add_swap_count_continuation - called when a swap count is duplicated
+ * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
+ * page of the original vmalloc'ed swap_map, to hold the continuation count
+ * (for that entry and for its neighbouring PAGE_SIZE swap entries).  Called
+ * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
+ *
+ * These continuation pages are seldom referenced: the common paths all work
+ * on the original swap_map, only referring to a continuation page when the
+ * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
+ *
+ * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
+ * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
+ * can be called after dropping locks.
+ */
+int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
+{
+       struct swap_info_struct *si;
+       struct page *head;
+       struct page *page;
+       struct page *list_page;
+       pgoff_t offset;
+       unsigned char count;
+
+       /*
+        * When debugging, it's easier to use __GFP_ZERO here; but it's better
+        * for latency not to zero a page while GFP_ATOMIC and holding locks.
+        */
+       page = alloc_page(gfp_mask | __GFP_HIGHMEM);
+
+       si = swap_info_get(entry);
+       if (!si) {
+               /*
+                * An acceptable race has occurred since the failing
+                * __swap_duplicate(): the swap entry has been freed,
+                * perhaps even the whole swap_map cleared for swapoff.
+                */
+               goto outer;
+       }
+
+       offset = swp_offset(entry);
+       count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
+
+       if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
+               /*
+                * The higher the swap count, the more likely it is that tasks
+                * will race to add swap count continuation: we need to avoid
+                * over-provisioning.
+                */
+               goto out;
+       }
+
+       if (!page) {
+               spin_unlock(&swap_lock);
+               return -ENOMEM;
+       }
+
+       /*
+        * We are fortunate that although vmalloc_to_page uses pte_offset_map,
+        * no architecture is using highmem pages for kernel pagetables: so it
+        * will not corrupt the GFP_ATOMIC caller's atomic pagetable kmaps.
+        */
+       head = vmalloc_to_page(si->swap_map + offset);
+       offset &= ~PAGE_MASK;
+
+       /*
+        * Page allocation does not initialize the page's lru field,
+        * but it does always reset its private field.
+        */
+       if (!page_private(head)) {
+               BUG_ON(count & COUNT_CONTINUED);
+               INIT_LIST_HEAD(&head->lru);
+               set_page_private(head, SWP_CONTINUED);
+               si->flags |= SWP_CONTINUED;
+       }
+
+       list_for_each_entry(list_page, &head->lru, lru) {
+               unsigned char *map;
+
+               /*
+                * If the previous map said no continuation, but we've found
+                * a continuation page, free our allocation and use this one.
+                */
+               if (!(count & COUNT_CONTINUED))
+                       goto out;
+
+               map = kmap_atomic(list_page, KM_USER0) + offset;
+               count = *map;
+               kunmap_atomic(map, KM_USER0);
+
+               /*
+                * If this continuation count now has some space in it,
+                * free our allocation and use this one.
+                */
+               if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
+                       goto out;
+       }
+
+       list_add_tail(&page->lru, &head->lru);
+       page = NULL;                    /* now it's attached, don't free it */
+out:
+       spin_unlock(&swap_lock);
+outer:
+       if (page)
+               __free_page(page);
+       return 0;
+}

参考:

1.  Understanding the Linux Kernel 3rd.

2. 【无聊科普】固态硬盘(SSD)为什么需要TRIM?http://www.guokr.com/blog/483475/

From Li Haifeng's Blog, post Understanding Swap in Linux Kernel

Post Footer automatically generated by wp-posturl plugin for wordpress.

Contiguous Memory Allocator (CMA) 源码分析

原创文章,转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: Contiguous Memory Allocator (CMA) 源码分析

目录

1. 简述
2. 初始化
3. 分配
4. 释放
5. 小结

 

1.    简述

CMA的全称是contiguous memory allocator, 其工作原理是:预留一段的内存给驱动使用,但当驱动不用的时候,memory allocator(buddy system)可以分配给用户进程用作匿名内存或者页缓存。而当驱动需要使用时,就将进程占用的内存通过回收或者迁移的方式将之前占用的预留内存腾出来,供驱动使用。本文对CMA的初始化,分配和释放做一下源码分析(源码版本v3.10).

 

2.    初始化

CMA的初始化必须在buddy 物理内存管理初始化之前和 memory
block early allocator 
分配器初始化之后(可参考dma_contiguous_reserve函数的注释:This function reserves memory from early allocator. It should be called by arch specific code once the early allocator (memblock or bootmem) has
been activated and all other subsystems have already allocated/reserved
memory.)

ARM中,初始化CMA的接口是:dma_contiguous_reserve(phys_addr_t limit)。参数limit是指该CMA区域的上限。

setup_arch->arm_memblock_init->dma_contiguous_reserve:

107 void __init dma_contiguous_reserve(phys_addr_t limit)
108 {
109         phys_addr_t selected_size = 0;
110
111         pr_debug("%s(limit %08lx)\n", __func__, (unsigned long)limit);
112
113         if (size_cmdline != -1) {
114                 selected_size = size_cmdline;
115         } else {
116 #ifdef CONFIG_CMA_SIZE_SEL_MBYTES
117                 selected_size = size_bytes;
118 #elif defined(CONFIG_CMA_SIZE_SEL_PERCENTAGE)
119                 selected_size = cma_early_percent_memory();
120 #elif defined(CONFIG_CMA_SIZE_SEL_MIN)
121                 selected_size = min(size_bytes, cma_early_percent_memory());
122 #elif defined(CONFIG_CMA_SIZE_SEL_MAX)
123                 selected_size = max(size_bytes, cma_early_percent_memory());
124 #endif
125         }
126
127         if (selected_size) {
128                 pr_debug("%s: reserving %ld MiB for global area\n", __func__,
129                          (unsigned long)selected_size / SZ_1M);
130
131                 dma_declare_contiguous(NULL, selected_size, 0, limit);
132         }
133 };

在该函数中,需要弄清楚俩值,分别是selected_sizelimitselectetd_size是声明CMA区域的大小,limit规定了CMA区域在分配时候的上界。

首先介绍下怎样获得selected_size: cmdline中定义了cma=”xxx”,那么就用cmdline中规定的(114)。若cmdline中没有定义,则看有没有在config文件中定义CONFIG_CMA_SIZE_SEL_MBYTES(117)或者CONFIG_CMA_SIZE_SEL_PERCENTAGE119行)。如果前面两个配置项都没有定义,则从CONFIG_CMA_SIZE_MBYTESCONFIG_CMA_SIZE_PERCENTAGE中选择两者的最小值(121)或者最大值(123)

计算好CMAsize并得到了limit后,就进入dma_declare_contiguous中。

setup_arch->arm_memblock_init->dma_contiguous_reserve->
dma_declare_contiguous:

218 /**
219  * dma_declare_contiguous() - reserve area for contiguous memory handling
220  *                            for particular device
221  * @dev:   Pointer to device structure.
222  * @size:  Size of the reserved memory.
223  * @base:  Start address of the reserved memory (optional, 0 for any).
224  * @limit: End address of the reserved memory (optional, 0 for any).
225  *
226  * This function reserves memory for specified device. It should be
227  * called by board specific code when early allocator (memblock or bootmem)
228  * is still activate.
229  */
230 int __init dma_declare_contiguous(struct device *dev, phys_addr_t size,
231                                   phys_addr_t base, phys_addr_t limit)
232 {
233         struct cma_reserved *r = &cma_reserved[cma_reserved_count];
234         phys_addr_t alignment;
235
…
248
249         /* Sanitise input arguments */
250         alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
251         base = ALIGN(base, alignment);
252         size = ALIGN(size, alignment);
253         limit &= ~(alignment - 1);
254
255         /* Reserve memory */
256         if (base) {
257                 if (memblock_is_region_reserved(base, size) ||
258                     memblock_reserve(base, size) < 0) {
259                         base = -EBUSY;
260                         goto err;
261                 }
262         } else {
263                 /*
264                  * Use __memblock_alloc_base() since
265                  * memblock_alloc_base() panic()s.
266                  */
267                 phys_addr_t addr = __memblock_alloc_base(size, alignment, limit);
268                 if (!addr) {
269                         base = -ENOMEM;
270                         goto err;
271                 } else {
272                         base = addr;
273                 }
274         }
275
276         /*
277          * Each reserved area must be initialised later, when more kernel
278          * subsystems (like slab allocator) are available.
279          */
280         r->start = base;
281         r->size = size;
282         r->dev = dev;
283         cma_reserved_count++;
284         pr_info("CMA: reserved %ld MiB at %08lx\n", (unsigned long)size / SZ_1M,
285                 (unsigned long)base);
286
287         /* Architecture specific contiguous memory fixup. */
288         dma_contiguous_early_fixup(base, size);
289         return 0;
290 err:
291         pr_err("CMA: failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
292         return base;
293 }

在该函数中,首先根据输入的参数sizelimit,得到CMA区域的基址和大小。基址若没有指定的话(在该初始化的情境中是0),就需要用early allocator分配了。而大小需要进行一个alignment,这个alignment一般是4MB250行,MAX_ORDER11 pageblock_order10)。用early allocator分配的这个CMA会从物理内存lowmem的高地址开始分配。

得到CMA区域的基址和大小后,会存入到cma_reserved[]全局数组中(280282)。全局变量cma_reserved_count来标识在cma_reserved[]数组中,保留了多少个cma区(283行)

ARMkernel code中,得到的CMA区域还会保存到dma_mmu_remap数组中(这个dma_mmu_remap数据结构只记录基址和大小,下面396410行)。

396 struct dma_contig_early_reserve {
397         phys_addr_t base;
398         unsigned long size;
399 };
400
401 static struct dma_contig_early_reserve dma_mmu_remap[MAX_CMA_AREAS] __initdata;
402
403 static int dma_mmu_remap_num __initdata;
404
405 void __init dma_contiguous_early_fixup(phys_addr_t base, unsigned long size)
406 {
407         dma_mmu_remap[dma_mmu_remap_num].base = base;
408         dma_mmu_remap[dma_mmu_remap_num].size = size;
409         dma_mmu_remap_num++;
410 }

以上,只是将CMA区域reserve下来并记录到相关的数组中。buddy系统初始化结束后,会对reservedCMA区域进行进一步的处理:

200 static int __init cma_init_reserved_areas(void)
201 {
202         struct cma_reserved *r = cma_reserved;
203         unsigned i = cma_reserved_count;
204
205         pr_debug("%s()\n", __func__);
206
207         for (; i; --i, ++r) {
208                 struct cma *cma;
209                 cma = cma_create_area(PFN_DOWN(r->start),
210                                       r->size >> PAGE_SHIFT);
211                 if (!IS_ERR(cma))
212                         dev_set_cma_area(r->dev, cma);
213         }
214         return 0;
215 }
216 core_initcall(cma_init_reserved_areas);

212行,创建的struct cma结构会设置在dev->cma_area中。这样,当某个外设进行CMA分配的时候,便根据dev->cma_area中设定的区间进行CMA
buffer
的分配。

cma_init_reserved_areas->
cma_create_area:

159 static __init struct cma *cma_create_area(unsigned long base_pfn,
160                                      unsigned long count)
161 {
162         int bitmap_size = BITS_TO_LONGS(count) * sizeof(long);
163         struct cma *cma;
164         int ret = -ENOMEM;
165
166         pr_debug("%s(base %08lx, count %lx)\n", __func__, base_pfn, count);
167
168         cma = kmalloc(sizeof *cma, GFP_KERNEL);
169         if (!cma)
170                 return ERR_PTR(-ENOMEM);
171
172         cma->base_pfn = base_pfn;
173         cma->count = count;
174         cma->bitmap = kzalloc(bitmap_size, GFP_KERNEL);
175
176         if (!cma->bitmap)
177                 goto no_mem;
178
179         ret = cma_activate_area(base_pfn, count);
180         if (ret)
181                 goto error;
182
183         pr_debug("%s: returned %p\n", __func__, (void *)cma);
184         return cma;
185
186 error:
187         kfree(cma->bitmap);
188 no_mem:
189         kfree(cma);
190         return ERR_PTR(ret);
191 }

cma_init_reserved_areas->
cma_create_area-> cma_create_area:

137 static __init int cma_activate_area(unsigned long base_pfn, unsigned long count)
138 {
139         unsigned long pfn = base_pfn;
140         unsigned i = count >> pageblock_order;
141         struct zone *zone;
142
143         WARN_ON_ONCE(!pfn_valid(pfn));
144         zone = page_zone(pfn_to_page(pfn));
145
146         do {
147                 unsigned j;
148                 base_pfn = pfn;
149                 for (j = pageblock_nr_pages; j; --j, pfn++) {
150                         WARN_ON_ONCE(!pfn_valid(pfn));
151                         if (page_zone(pfn_to_page(pfn)) != zone)
152                                 return -EINVAL;
153                 }
154                 init_cma_reserved_pageblock(pfn_to_page(base_pfn));
155         } while (--i);
156         return 0;
157 }

在之前CMA初始化的时候,看到其basesize都会pageblock_order对齐。pageblock_order的值是10,即一个pageblock_order代表4MB的内存块(2^10 * PAGE_SIZE)。因此,该函数cma_activate_area是对每一个用于CMAblock进行初始化(140行,155行)。由于CMA规定了,其区域内的页面必须在一个zone中,因此149153行对每一个页面进行甄别是否都在同一个zone中,然后对CMA区域内的每一个pageblock进行初始化。

cma_init_reserved_areas->
cma_create_area-> cma_create_area-> init_cma_reserved_pageblock:

769 #ifdef CONFIG_CMA
770 /* Free whole pageblock and set it's migration type to MIGRATE_CMA. */
771 void __init init_cma_reserved_pageblock(struct page *page)
772 {
773         unsigned i = pageblock_nr_pages;
774         struct page *p = page;
775
776         do {
777                 __ClearPageReserved(p);
778                 set_page_count(p, 0);
779         } while (++p, --i);
780
781         set_page_refcounted(page);
782         set_pageblock_migratetype(page, MIGRATE_CMA);
783         __free_pages(page, pageblock_order);
784         totalram_pages += pageblock_nr_pages;
785 #ifdef CONFIG_HIGHMEM
786         if (PageHighMem(page))
787                 totalhigh_pages += pageblock_nr_pages;
788 #endif
789 }

进入buddy的空闲页面其page->_count都需要为0.因此在778行设置pageblock 区内的每一个page的使用技术都为0.781行将一个pageblock块的第一个page_count设置为1的原因是在783行的__free_pages的时候会put_page_testzero1. 同时还需要设置pageblock的第一个页面的migratetypeMIGRATE_CMA. 所有页面都有一个migratetype放在zone->pageblock_flags中,每个migratetype3bit,但对于buddy
system
中的pageblock的第一个页面的migratetype才有意义(其他页面设置了也用不上)。

对页面的初始化做完后,就通过__free_pages将其存放在buddy system(783) 

由此可见在初始化的时候,所有的CMA都放在order10buddy链表中,具体放在相关zonezone->free_area[10].free_list[MIGRATE_CMA]链表上。

3.    分配

CMA并不直接开放给driver的开发者。开发者只需要在需要分配dma缓冲区的时候,调用dma相关函数就可以了,例如dma_alloc_coherent。最终dma相关的分配函数会到达cma的分配函数:dma_alloc_from_contiguous

295 /**
296  * dma_alloc_from_contiguous() - allocate pages from contiguous area
297  * @dev:   Pointer to device for which the allocation is performed.
298  * @count: Requested number of pages.
299  * @align: Requested alignment of pages (in PAGE_SIZE order).
300  *
301  * This function allocates memory buffer for specified device. It uses
302  * device specific contiguous memory area if available or the default
303  * global one. Requires architecture specific get_dev_cma_area() helper
304  * function.
305  */
306 struct page *dma_alloc_from_contiguous(struct device *dev, int count,
307                                        unsigned int align)
308 {
309         unsigned long mask, pfn, pageno, start = 0;
310         struct cma *cma = dev_get_cma_area(dev);
311         struct page *page = NULL;
312         int ret;
313
314         if (!cma || !cma->count)
315                 return NULL;
316
317         if (align > CONFIG_CMA_ALIGNMENT)
318                 align = CONFIG_CMA_ALIGNMENT;
319
320         pr_debug("%s(cma %p, count %d, align %d)\n", __func__, (void *)cma,
321                  count, align);
322
323         if (!count)
324                 return NULL;
325
326         mask = (1 << align) - 1;
327
328         mutex_lock(&cma_mutex);
329
330         for (;;) {
331                 pageno = bitmap_find_next_zero_area(cma->bitmap, cma->count,
332                                                     start, count, mask);
333                 if (pageno >= cma->count)
334                         break;
335
336                 pfn = cma->base_pfn + pageno;
337                 ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
338                 if (ret == 0) {
339                         bitmap_set(cma->bitmap, pageno, count);
340                         page = pfn_to_page(pfn);
341                         break;
342                 } else if (ret != -EBUSY) {
343                         break;
344                 }
345                 pr_debug("%s(): memory range at %p is busy, retrying\n",
346                          __func__, pfn_to_page(pfn));
347                 /* try again with a bit different memory target */
348                 start = pageno + mask + 1;
349         }
350
351         mutex_unlock(&cma_mutex);
352         pr_debug("%s(): returned %p\n", __func__, page);
353         return page;
354 }

301304行的注释,告诉该函数的目的是从特定的driver(或者系统默认)CMA中分配一段buffer. 310行是获取特定driverCMA区域,若dev没有对应的CMA,则从系统默认的CMA区中查找。每一个CMA区域都有一个bitmap用来记录对应的page是否已经被使用(struct cma->bitmap)。因此从CMA区域查找一定数量的连续内存页的方法就是在cma->bitmap中查找连续的N个为0bit,代表连续的N个物理页。若找到的话就返回一个不大于CMA边界的索引(333)并设置对应的cma->bitmap中的bit位(339行)。

dma_alloc_from_contiguous-> dma_alloc_from_contiguous:

5908 /**
5909  * alloc_contig_range() -- tries to allocate given range of pages
5910  * @start:      start PFN to allocate
5911  * @end:        one-past-the-last PFN to allocate
5912  * @migratetype:        migratetype of the underlaying pageblocks (either
5913  *                      #MIGRATE_MOVABLE or #MIGRATE_CMA).  All pageblocks
5914  *                      in range must have the same migratetype and it must
5915  *                      be either of the two.
5916  *
5917  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
5918  * aligned, however it's the caller's responsibility to guarantee that
5919  * we are the only thread that changes migrate type of pageblocks the
5920  * pages fall in.
5921  *
5922  * The PFN range must belong to a single zone.
5923  *
5924  * Returns zero on success or negative error code.  On success all
5925  * pages which PFN is in [start, end) are allocated for the caller and
5926  * need to be freed with free_contig_range().
5927  */
5928 int alloc_contig_range(unsigned long start, unsigned long end,
5929                        unsigned migratetype)
5930 {
5931         unsigned long outer_start, outer_end;
5932         int ret = 0, order;
5933
5934         struct compact_control cc = {
5935                 .nr_migratepages = 0,
5936                 .order = -1,
5937                 .zone = page_zone(pfn_to_page(start)),
5938                 .sync = true,
5939                 .ignore_skip_hint = true,
5940         };
5941         INIT_LIST_HEAD(&cc.migratepages);
5942
5943         /*
5944          * What we do here is we mark all pageblocks in range as
5945          * MIGRATE_ISOLATE.  Because pageblock and max order pages may
5946          * have different sizes, and due to the way page allocator
5947          * work, we align the range to biggest of the two pages so
5948          * that page allocator won't try to merge buddies from
5949          * different pageblocks and change MIGRATE_ISOLATE to some
5950          * other migration type.
5951          *
5952          * Once the pageblocks are marked as MIGRATE_ISOLATE, we
5953          * migrate the pages from an unaligned range (ie. pages that
5954          * we are interested in).  This will put all the pages in
5955          * range back to page allocator as MIGRATE_ISOLATE.
5956          *
5957          * When this is done, we take the pages in range from page
5958          * allocator removing them from the buddy system.  This way
5959          * page allocator will never consider using them.
5960          *
5961          * This lets us mark the pageblocks back as
5962          * MIGRATE_CMA/MIGRATE_MOVABLE so that free pages in the
5963          * aligned range but not in the unaligned, original range are
5964          * put back to page allocator so that buddy can use them.
5965          */
5966
5967         ret = start_isolate_page_range(pfn_max_align_down(start),
5968                                        pfn_max_align_up(end), migratetype,
5969                                        false);
5970         if (ret)
5971                 return ret;
5972
5973         ret = __alloc_contig_migrate_range(&cc, start, end);
5974         if (ret)
5975                 goto done;
5976
5977         /*
5978          * Pages from [start, end) are within a MAX_ORDER_NR_PAGES
5979          * aligned blocks that are marked as MIGRATE_ISOLATE.  What's
5980          * more, all pages in [start, end) are free in page allocator.
5981          * What we are going to do is to allocate all pages from
5982          * [start, end) (that is remove them from page allocator).
5983          *
5984          * The only problem is that pages at the beginning and at the
5985          * end of interesting range may be not aligned with pages that
5986          * page allocator holds, ie. they can be part of higher order
5987          * pages.  Because of this, we reserve the bigger range and
5988          * once this is done free the pages we are not interested in.
5989          *
5990          * We don't have to hold zone->lock here because the pages are
5991          * isolated thus they won't get removed from buddy.
5992          */
5993
5994         lru_add_drain_all();
5995         drain_all_pages();
5996
5997         order = 0;
5998         outer_start = start;
5999         while (!PageBuddy(pfn_to_page(outer_start))) {
6000                 if (++order >= MAX_ORDER) {
6001                         ret = -EBUSY;
6002                         goto done;
6003                 }
6004                 outer_start &= ~0UL << order;
6005         }
6006
6007         /* Make sure the range is really isolated. */
6008         if (test_pages_isolated(outer_start, end, false)) {
6009                 pr_warn("alloc_contig_range test_pages_isolated(%lx, %lx) failed\n",
6010                        outer_start, end);
6011                 ret = -EBUSY;
6012                 goto done;
6013         }
6014
6015
6016         /* Grab isolated pages from freelists. */
6017         outer_end = isolate_freepages_range(&cc, outer_start, end);             
6018         if (!outer_end) {
6019                 ret = -EBUSY;
6020                 goto done;
6021         }
6022
6023         /* Free head and tail (if any) */
6024         if (start != outer_start)
6025                 free_contig_range(outer_start, start - outer_start);
6026         if (end != outer_end)
6027                 free_contig_range(end, outer_end - end);
6028
6029 done:
6030         undo_isolate_page_range(pfn_max_align_down(start),
6031                                 pfn_max_align_up(end), migratetype);
6032         return ret;
6033 }

该函数的注释中讲述了调用该函数需要注意的事项:对齐,所分配的页面都在一个zone中。释放时,需要使用free_contig_range

5967行,先对始末区间进行对齐,然后通过start_isolate_page_range,先确认该区间内没有unmovable的页,如果有unmovable的页,那unmovable的页占着内存而不能被迁移,导致整个区间就都不能被用作CMAstart_isolate_page_range->set_migratetype_isolate->has_unmovable_pages)。确认没有unmovable页后,将该区间的pageblock标志为MIGRATE_ISOLATE。并将对应的pagebuddy中都移到freearea[].free_list[MIGRATE_ISLOATE]的链表上,并调用drain_all_pages(start_isolate_page_range->set_migratetype_isolate),将每处理器上暂存的空闲页都释放给buddy因为有可能在要分配的CMA区间中有页面还在pcppageset—pageset记录了每cpu暂存的空闲热页)。然后通过5973行的__alloc_contig_migrate_range,将被隔离出的页中,已经被buddy分配出去的页摘出来,然后迁移到其他地方,以腾出物理页给CMA用。腾出连续的物理页后,便会通过6017行的isolate_freepages_range来将这段连续的空闲物理页从buddy
system
取下来。

__alloc_contig_migrate_range的代码如下:

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range:

5862 /* [start, end) must belong to a single zone. */
5863 static int __alloc_contig_migrate_range(struct compact_control *cc,
5864                                         unsigned long start, unsigned long end)
5865 {
5866         /* This function is based on compact_zone() from compaction.c. */
5867         unsigned long nr_reclaimed;
5868         unsigned long pfn = start;
5869         unsigned int tries = 0;
5870         int ret = 0;
5871
5872         migrate_prep();
5873
5874         while (pfn < end || !list_empty(&cc->migratepages)) {
5875                 if (fatal_signal_pending(current)) {
5876                         ret = -EINTR;
5877                         break;
5878                 }
5879
5880                 if (list_empty(&cc->migratepages)) {
5881                         cc->nr_migratepages = 0;
5882                         pfn = isolate_migratepages_range(cc->zone, cc,
5883                                                          pfn, end, true);
5884                         if (!pfn) {
5885                                 ret = -EINTR;
5886                                 break;
5887                         }
5888                         tries = 0;
5889                 } else if (++tries == 5) {
5890                         ret = ret < 0 ? ret : -EBUSY;
5891                         break;
5892                 }
5893
5894                 nr_reclaimed = reclaim_clean_pages_from_list(cc->zone,
5895                                                         &cc->migratepages);
5896                 cc->nr_migratepages -= nr_reclaimed;
5897
5898                 ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
5899                                     0, MIGRATE_SYNC, MR_CMA);
5900         }
5901         if (ret < 0) {
5902                 putback_movable_pages(&cc->migratepages);
5903                 return ret;
5904         }
5905         return 0;
5906 }

该函数主要进行迁移工作。由于CMA区域的页是允许被buddy system当作movable页分配出去的,所以,如果某些页之前被buddy分配出去了,但在cma->bitmap上仍然记录该页可以被用作CMA,所以这时候就需要将该页迁移到别的地方以将该页腾出来供CMA用。

5882行做的事情是,将被buddy分配出去的页挂到cc->migratepages的链表上。然后通过5894行的reclaim_clean_pages_from_list看是否某些页是clean可以直接回收掉,之后在通过5898行的migrate_pages将暂时不能回收的内存内容迁移到物理内存的其他地方。

隔离需要迁移和回收页的函数isolate_migratepages_range如下:

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> isolate_migratepages_range:

427 /**
 428  * isolate_migratepages_range() - isolate all migrate-able pages in range.
 429  * @zone:       Zone pages are in.
 430  * @cc:         Compaction control structure.
 431  * @low_pfn:    The first PFN of the range.
 432  * @end_pfn:    The one-past-the-last PFN of the range.
 433  * @unevictable: true if it allows to isolate unevictable pages
 434  *
 435  * Isolate all pages that can be migrated from the range specified by
 436  * [low_pfn, end_pfn).  Returns zero if there is a fatal signal
 437  * pending), otherwise PFN of the first page that was not scanned
 438  * (which may be both less, equal to or more then end_pfn).
 439  *
 440  * Assumes that cc->migratepages is empty and cc->nr_migratepages is
 441  * zero.
 442  *
 443  * Apart from cc->migratepages and cc->nr_migratetypes this function
 444  * does not modify any cc's fields, in particular it does not modify
 445  * (or read for that matter) cc->migrate_pfn.
 446  */
 447 unsigned long
 448 isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 449                 unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
 450 {
 451         unsigned long last_pageblock_nr = 0, pageblock_nr;
 452         unsigned long nr_scanned = 0, nr_isolated = 0;
 453         struct list_head *migratelist = &cc->migratepages;
 454         isolate_mode_t mode = 0;
 455         struct lruvec *lruvec;
 456         unsigned long flags;
 457         bool locked = false;
 458         struct page *page = NULL, *valid_page = NULL;
 459
 460         /*
 461          * Ensure that there are not too many pages isolated from the LRU
 462          * list by either parallel reclaimers or compaction. If there are,
 463          * delay for some time until fewer pages are isolated
 464          */
 465         while (unlikely(too_many_isolated(zone))) {
 466                 /* async migration should just abort */
 467                 if (!cc->sync)
 468                         return 0;
 469
 470                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 471
 472                 if (fatal_signal_pending(current))
 473                         return 0;
 474         }
 475
 476         /* Time to isolate some pages for migration */
 477         cond_resched();
 478         for (; low_pfn < end_pfn; low_pfn++) {
 479                 /* give a chance to irqs before checking need_resched() */
 480                 if (locked && !((low_pfn+1) % SWAP_CLUSTER_MAX)) {
 481                         if (should_release_lock(&zone->lru_lock)) {
 482                                 spin_unlock_irqrestore(&zone->lru_lock, flags);
 483                                 locked = false;
 484                         }
 485                 }
 486
 487                 /*
 488                  * migrate_pfn does not necessarily start aligned to a
 489                  * pageblock. Ensure that pfn_valid is called when moving
 490                  * into a new MAX_ORDER_NR_PAGES range in case of large
 491                  * memory holes within the zone
 492                  */
 493                 if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
 494                         if (!pfn_valid(low_pfn)) {
 495                                 low_pfn += MAX_ORDER_NR_PAGES - 1;
 496                                 continue;
 497                         }
 498                 }
 499
 500                 if (!pfn_valid_within(low_pfn))
 501                         continue;
 502                 nr_scanned++;
 503
 504                 /*
 505                  * Get the page and ensure the page is within the same zone.
 506                  * See the comment in isolate_freepages about overlapping
 507                  * nodes. It is deliberate that the new zone lock is not taken
 508                  * as memory compaction should not move pages between nodes.
 509                  */
 510                 page = pfn_to_page(low_pfn);
 511                 if (page_zone(page) != zone)
 512                         continue;
 513
 514                 if (!valid_page)
 515                         valid_page = page;
 516
 517                 /* If isolation recently failed, do not retry */
 518                 pageblock_nr = low_pfn >> pageblock_order;
 519                 if (!isolation_suitable(cc, page))
 520                         goto next_pageblock;
 521
 522                 /* Skip if free */
 523                 if (PageBuddy(page))
 524                         continue;
 525
 526                 /*
 527                  * For async migration, also only scan in MOVABLE blocks. Async
 528                  * migration is optimistic to see if the minimum amount of work
 529                  * satisfies the allocation
 530                  */
 531                 if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 532                     !migrate_async_suitable(get_pageblock_migratetype(page))) {
 533                         cc->finished_update_migrate = true;
 534                         goto next_pageblock;
 535                 }
 536
 537                 /*
 538                  * Check may be lockless but that's ok as we recheck later.
 539                  * It's possible to migrate LRU pages and balloon pages
 540                  * Skip any other type of page
 541                  */
 542                 if (!PageLRU(page)) {
 543                         if (unlikely(balloon_page_movable(page))) {
 544                                 if (locked && balloon_page_isolate(page)) {
 545                                         /* Successfully isolated */
 546                                         cc->finished_update_migrate = true;
 547                                         list_add(&page->lru, migratelist);
 548                                         cc->nr_migratepages++;
 549                                         nr_isolated++;
 550                                         goto check_compact_cluster;
 551                                 }
 552                         }
 553                         continue;
 554                 }
 555
 556                 /*
 557                  * PageLRU is set. lru_lock normally excludes isolation
 558                  * splitting and collapsing (collapsing has already happened
 559                  * if PageLRU is set) but the lock is not necessarily taken
 560                  * here and it is wasteful to take it just to check transhuge.
 561                  * Check TransHuge without lock and skip the whole pageblock if
 562                  * it's either a transhuge or hugetlbfs page, as calling
 563                  * compound_order() without preventing THP from splitting the
 564                  * page underneath us may return surprising results.
 565                  */
 566                 if (PageTransHuge(page)) {
 567                         if (!locked)
 568                                 goto next_pageblock;
 569                         low_pfn += (1 << compound_order(page)) - 1;
 570                         continue;
 571                 }
 572
 573                 /* Check if it is ok to still hold the lock */
 574                 locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
 575                                                                 locked, cc);
 576                 if (!locked || fatal_signal_pending(current))
 577                         break;
 578
 579                 /* Recheck PageLRU and PageTransHuge under lock */
 580                 if (!PageLRU(page))
 581                         continue;
 582                 if (PageTransHuge(page)) {
 583                         low_pfn += (1 << compound_order(page)) - 1;
 584                         continue;
 585                 }
 586
 587                 if (!cc->sync)
 588                         mode |= ISOLATE_ASYNC_MIGRATE;
 589
 590                 if (unevictable)
 591                         mode |= ISOLATE_UNEVICTABLE;
 592
 593                 lruvec = mem_cgroup_page_lruvec(page, zone);
 594
 595                 /* Try isolate the page */
 596                 if (__isolate_lru_page(page, mode) != 0)
 597                         continue;
 598
 599                 VM_BUG_ON(PageTransCompound(page));
 600
 601                 /* Successfully isolated */
 602                 cc->finished_update_migrate = true;
 603                 del_page_from_lru_list(page, lruvec, page_lru(page));
 604                 list_add(&page->lru, migratelist);
 605                 cc->nr_migratepages++;
 606                 nr_isolated++;
 607
 608 check_compact_cluster:
 609                 /* Avoid isolating too much */
 610                 if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
 611                         ++low_pfn;
 612                         break;
 613                 }
 614
 615                 continue;
 616
 617 next_pageblock:
 618                 low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
 619                 last_pageblock_nr = pageblock_nr;
 620         }
 621
 622         acct_isolated(zone, locked, cc);
 623
 624         if (locked)
 625                 spin_unlock_irqrestore(&zone->lru_lock, flags);
 626
 627         /* Update the pageblock-skip if the whole pageblock was scanned */
 628         if (low_pfn == end_pfn)
 629                 update_pageblock_skip(cc, valid_page, nr_isolated, true);
 630
 631         trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 632
 633         count_compact_events(COMPACTMIGRATE_SCANNED, nr_scanned);
 634         if (nr_isolated)
 635                 count_compact_events(COMPACTISOLATED, nr_isolated);
 636
 637         return low_pfn;
 638 }

上面函数的作用是在指定分配到的CMA区域的范围内将被使用到的内存(PageLRU(Page)不为空)隔离出来挂到cc->migratepages链表上,以备以后迁移。要迁移的页分两类,一类是可以直接被回收的(比如页缓存),另一类是暂时不能被回收,内容需要迁移到其他地方。可以被回收的物理页流程如下:

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> reclaim_clean_pages_from_list:

968 unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 969                                             struct list_head *page_list)
 970 {
 971         struct scan_control sc = {
 972                 .gfp_mask = GFP_KERNEL,
 973                 .priority = DEF_PRIORITY,
 974                 .may_unmap = 1,
 975         };
 976         unsigned long ret, dummy1, dummy2;
 977         struct page *page, *next;
 978         LIST_HEAD(clean_pages);
 979
 980         list_for_each_entry_safe(page, next, page_list, lru) {
 981                 if (page_is_file_cache(page) && !PageDirty(page)) {
 982                         ClearPageActive(page);
 983                         list_move(&page->lru, &clean_pages);
 984                 }
 985         }
 986
 987         ret = shrink_page_list(&clean_pages, zone, &sc,
 988                                 TTU_UNMAP|TTU_IGNORE_ACCESS,
 989                                 &dummy1, &dummy2, true);
 990         list_splice(&clean_pages, page_list);
 991         __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
 992         return ret;
 993 }

在该函数中,对于那些用于文件缓存的页,如果是干净的,就进行直接回收(981989行),下次该内容需要被用到,再次从文件中读取就是了。如果由于一些原因不能被回收掉的,那就挂回cc->migratepages链表上进行迁移(990行)。

迁移页的流程如下:

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> migrate_pages:

988 /*
 989  * migrate_pages - migrate the pages specified in a list, to the free pages
 990  *                 supplied as the target for the page migration
 991  *
 992  * @from:               The list of pages to be migrated.
 993  * @get_new_page:       The function used to allocate free pages to be used
 994  *                      as the target of the page migration.
 995  * @private:            Private data to be passed on to get_new_page()
 996  * @mode:               The migration mode that specifies the constraints for
 997  *                      page migration, if any.
 998  * @reason:             The reason for page migration.
 999  *
1000  * The function returns after 10 attempts or if no pages are movable any more
1001  * because the list has become empty or no retryable pages exist any more.
1002  * The caller should call putback_lru_pages() to return pages to the LRU
1003  * or free list only if ret != 0.
1004  *
1005  * Returns the number of pages that were not migrated, or an error code.
1006  */
1007 int migrate_pages(struct list_head *from, new_page_t get_new_page,
1008                 unsigned long private, enum migrate_mode mode, int reason)
1009 {
1010         int retry = 1;
1011         int nr_failed = 0;
1012         int nr_succeeded = 0;
1013         int pass = 0;
1014         struct page *page;
1015         struct page *page2;
1016         int swapwrite = current->flags & PF_SWAPWRITE;
1017         int rc;
1018
1019         if (!swapwrite)
1020                 current->flags |= PF_SWAPWRITE;
1021
1022         for(pass = 0; pass < 10 && retry; pass++) {
1023                 retry = 0;
1024
1025                 list_for_each_entry_safe(page, page2, from, lru) {
1026                         cond_resched();
1027
1028                         rc = unmap_and_move(get_new_page, private,
1029                                                 page, pass > 2, mode);
1030
1031                         switch(rc) {
1032                         case -ENOMEM:
1033                                 goto out;
1034                         case -EAGAIN:
1035                                 retry++;
1036                                 break;
1037                         case MIGRATEPAGE_SUCCESS:
1038                                 nr_succeeded++;
1039                                 break;
1040                         default:
1041                                 /* Permanent failure */
1042                                 nr_failed++;
1043                                 break;
1044                         }
1045                 }
1046         }
1047         rc = nr_failed + retry;
1048 out:
1049         if (nr_succeeded)
1050                 count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
1051         if (nr_failed)
1052                 count_vm_events(PGMIGRATE_FAIL, nr_failed);
1053         trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
1054
1055         if (!swapwrite)
1056                 current->flags &= ~PF_SWAPWRITE;
1057
1058         return rc;
1059 }

该函数最重要的一句是1028行的unmap_and_move. 通过get_new_page新分配一个页,然后将内容move过去,并进行unmap.

dma_alloc_from_contiguous-> dma_alloc_from_contiguous-> __alloc_contig_migrate_range-> migrate_pages-> unmap_and_move:

 858 /*
 859  * Obtain the lock on page, remove all ptes and migrate the page
 860  * to the newly allocated page in newpage.
 861  */
 862 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 863                         struct page *page, int force, enum migrate_mode mode)
 864 {
 865         int rc = 0;
 866         int *result = NULL;
 867         struct page *newpage = get_new_page(page, private, &result);
 868
 869         if (!newpage)
 870                 return -ENOMEM;
 871
 872         if (page_count(page) == 1) {
 873                 /* page was freed from under us. So we are done. */
 874                 goto out;
 875         }
 876
 877         if (unlikely(PageTransHuge(page)))
 878                 if (unlikely(split_huge_page(page)))
 879                         goto out;
 880
 881         rc = __unmap_and_move(page, newpage, force, mode);
 882
 883         if (unlikely(rc == MIGRATEPAGE_BALLOON_SUCCESS)) {
 884                 /*
 885                  * A ballooned page has been migrated already.
 886                  * Now, it's the time to wrap-up counters,
 887                  * handle the page back to Buddy and return.
 888                  */
 889                 dec_zone_page_state(page, NR_ISOLATED_ANON +
 890                                     page_is_file_cache(page));
 891                 balloon_page_free(page);
 892                 return MIGRATEPAGE_SUCCESS;
 893         }
 894 out:
 895         if (rc != -EAGAIN) {
 896                 /*
 897                  * A page that has been migrated has all references
 898                  * removed and will be freed. A page that has not been
 899                  * migrated will have kepts its references and be
 900                  * restored.
 901                  */
 902                 list_del(&page->lru);
 903                 dec_zone_page_state(page, NR_ISOLATED_ANON +
 904                                 page_is_file_cache(page));
 905                 putback_lru_page(page);
 906         }
 907         /*
 908          * Move the new page to the LRU. If migration was not successful
 909          * then this will free the page.
 910          */
 911         putback_lru_page(newpage);
 912         if (result) {
 913                 if (rc)
 914                         *result = rc;
 915                 else
 916                         *result = page_to_nid(newpage);
 917         }
 918         return rc;
 919 }

该函数是先进行unmap,然后再进行move881行),这个move实际上是一个copy动作(__unmap_and_move->move_to_new_page->migrate_page->migrate_page_copy)。随后将迁移后老页释放到buddy系统中(905)

 

至此,CMA分配一段连续空闲物理内存的准备工作已经做完了(已经将连续的空闲物理内存放在一张链表上了cc->freepages)。但这段物理页还在buddy 系统上。因此,需要把它们从buddy 系统上摘除下来。摘除的操作并不通过通用的alloc_pages流程,而是手工进行处理(dma_alloc_from_contiguous-> dma_alloc_from_contiguous –>isolate_freepages_range)。在处理的时候,需要将连续的物理块进行打散(orderN->order0),并将物理块打头的页的page->lrubuddy的链表上取下。设置连续物理页块中的每一个物理页的struct page结构(split_free_page函数),设置其在zone->pageblock_flags迁移属性为MIGRATE_CMA

4.    释放

释放的流程比较简单。同分配一样,释放CMA的接口直接给dma。比如,dma_free_coherent。它会最终调用到CMA中的释放接口:free_contig_range

6035 void free_contig_range(unsigned long pfn, unsigned nr_pages)
6036 {
6037         unsigned int count = 0;
6038
6039         for (; nr_pages--; pfn++) {
6040                 struct page *page = pfn_to_page(pfn);
6041
6042                 count += page_count(page) != 1;
6043                 __free_page(page);
6044         }
6045         WARN(count != 0, "%d pages are still in use!\n", count);
6046 }

直接遍历每一个物理页将其释放到buddy系统中便是了(60396044行)。

 

5.    小结

CMA的使用避免了因内存预留给指定驱动而减少了系统可用内存的缺点。其CMA内存在驱动不用的时候可以分配给用户进程使用,而当其需要被驱动用作DMA传输时,将之前分配给用户进程的内存通过回收或者迁移的方式腾给驱动使用。

 

参考:

1. Linux内核最新的连续内存分配器(CMA)——避免预留大块内存, http://blog.csdn.net/21cnbao/article/details/7309757

2. A deep dive into CMA, http://lwn.net/Articles/486301/

From Li Haifeng's Blog, post Contiguous Memory Allocator (CMA) 源码分析

Post Footer automatically generated by wp-posturl plugin for wordpress.

Understanding Garbage Collector in ART of Android

原创文章,转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: Understanding Garbage Collector in ART of Android

大纲
  • 前言
  • Mark-Sweep Collector
    • Mark Stage
    • Sweep Stage
    • Finish Stage
  • Semi-Space Collector


前言
ART目前采取了垃圾追踪的回收策略,即若某个对象没有被引用,则被认为是可以回收的垃圾,释放垃圾占用的内存。ART实现了两种垃圾回收技术,一种是Mark-Sweep, 另一种是Semi-Space.
Mark-Sweep:它的大致思想是,将所有的对象在内存中的位置记录在位图A中。然后,从所有对象的根出发,扫描根对象的所有引用,扫描根对象的所有引用的引用,一层层逐级扫描,直到叶子节点的对象。在这个逐级扫描的过程中,将涉及到的对象的位置都记录在位图B中。扫描结束后,对比两张位图AB:所有A中置位的位置,却没有B中被置位,被视为垃圾。并根据位图中的索引检索得到对象,然后释放该对象占用的内存。
Semi-Space 它的特色是,有两个space空间,其中一个备用。另一个被系统拿来分配对象。在做垃圾扫描的时候,将所有在空间A中被有效引用的对象移动到空间B中。那么在空间A中剩余的对象就是垃圾了,直接释放垃圾对象所在的整个空间就可以了。这个回收速度很快。
Mark-Sweep Collector
要了解Android的垃圾回收策略,需要首先弄清楚对于每一个AndroidHeap,有哪些Space.对于普通的Android VM进程来说,其Heap都是从Zygote继承过来。所以,每一个VM 进程Heap的虚拟地址区间基本上都是一样的(区间的末端范围会有些不同)。对于一个普通的VM进程Heap的区间一般是这样划分的。
1.  Heap中的空间
Image Space的容量取决于boot.artboot.oat的容量大小(具体可以到/data/dalvik-cache/下查看这两个文件的大小)。在Image Space后面紧接着的就是Zygote Space。顾名思义,由于Zygote是所有VM进程的祖先,因此在普通的VM进程孵化出之前,Zygote是需要一些object的,这些object都被压缩放在Zygote Space中(为什么是压缩,后面讲Semi-Space时会有涉及)。紧挨着Zygote Space的是Main Space, Main Spacelimit被设置为了176MB. 其实从Zygote Space的起始到Non Moving Space的结束位置是有512MB大小的。
Mark-Sweep Collector只针对Heap中的Zygote SpaceMain SpaceNon Moving Space做垃圾回收处理. Temp SpaceBump pointer Space的垃圾回收归属于Semi-Space Collector来处理。
Mark-Sweep Collector根据轻重程度不同,可以分为三类:Sticky, Partial, Full. 又根据做GC的过程中是否暂停所有的mutator线程分为ConCurrentNon-Concurrent. 所以,ART系统一共定义了6MS的垃圾回收器。即,
333   // Create our garbage collectors.
334   for (size_t i = 0; i < 2; ++i) {
335     const bool concurrent = i != 0;
336     garbage_collectors_.push_back(new collector::MarkSweep(this, concurrent));
337     garbage_collectors_.push_back(new collector::PartialMarkSweep(this, concurrent));
338     garbage_collectors_.push_back(new collector::StickyMarkSweep(this, concurrent));
339   }
根据情况的不同,系统选择不同的垃圾回收器进行垃圾回收工作。当然,对系统的工作线程影响最小的应该是Concurrent Sticky Mark Sweep Collector了。影响最大的就是Full Mark Sweep Collector。而最复杂的属于Concurrent Mark Sweep Collector。如果理解了最繁琐的Concurrent Mark Sweep(下文成为CMS)算法,其他5中垃圾回收器的工作原理就也理解了。CMS工作从整体上可以划分两个大的阶段:Mark Sweep
1 Mark阶段
做标记工作,即从根对象出发,递归广度优先标记所有引用的对象。一个被引用的对象在标记的过程中,先被标记,然后放在栈中,等该对象的父对象全部被标记完成后,依次弹出栈中的每一个对象,并标记其引用,然后把其引用再丢到栈中。在整个Mark的过程中,把处于不同状态的对象可以形象的用颜色:黑、灰、白来描述。未进行标记前,所有的对象都是白色的,那些进过栈并出栈的对象又被称之为被标记过的对象称之为黑色的,而在栈中的对象被称之为灰色的。当Mark阶段完成后,所有的对象要不是黑色的,要不是白色的。白色的就是可以被回收的垃圾了。
Mark是从所有对象的根出发,这些根会有很多,最简单的根比如在当前函数栈里面的对象对其他对象的引用。由于是Concurrent,因此Mark的线程是和当前进程中的工作mutator线程[1]一起工作的,因此在Mark的过程中,mutator线程也会新分配对象或者修改对象的引用。因此,必须有一个阶段是需要Stop-The-World,暂停所有的mutator线程,以方便GC进行mark操作。如果整个Mark阶段都Stop-The-World,显然对mutator工作线程的影响太大了。ART的做法是,先Concurrent标记两遍,然后再Non-Concurrent标记一遍。两遍的原因是尽可能多的把需要标记的对象都在不暂停mutator的情况下标记了。具体的标记过程和代码罗列如下:
1.1   Initialize Phase
118 void MarkSweep::InitializePhase() {
121   mark_stack_ = heap_->mark_stack_.get();
123   immune_region_.Reset();
140 }
在初始预备阶段,为后面的mark准备好mark_stack_immune_region.
1.1.1  Mark_stack_是做什么呢?mark_stack_是一个FIFO的栈。由于标记是按照深度优先递归标记的,mark_stack_栈来进行配合进行深度递归。理论上,递归是不需要辅助栈的,利用函数本身的栈就可以实现深度递归。但是,采用了mark_stack_栈的好处是:1. 可以并行多线程来做,当mark_stack_栈中数量达到一定数目的时候(128个)就可以用多线程,将栈进行分段,多个线程共同将栈中的数据递归标记。2. 另外一个显而易见的好处是,可以避免函数栈的深度过深。
例如根据上图,假设在栈中存在B,C
a). 弹出C,标记C的引用F并压入F,标记G压入G。然后弹出G
b). G没有其他的引用,弹出FF没有其他的引用。弹出B
c). 标记B的引用D,压入D. 标记B的引用E,压入E。弹出E
d). E没有任何引用。弹出DD没有任何引用。栈空。结束。
1.1.2  immune_region是做什么的?
在一个Heap中有几个Space区,这些space的用途是不一样的,比如Image Space, 其中放的是boot.artboot.oat的代码段数据段。对于Image Space,其中的数据是不经常更改的,因此,GC是不会完整地扫描该区域。还有Zygote Space, 该区放的是ZygoteFork该进程之前的Zygote内容,因此也是不经常改变的。根据当前系统的内存短缺情况,采取不同程度的GC扫描策略,于是凡将某个space指定给immune_region的,GC便不会去扫描。可是,不扫描的话,如果该Space中有object的引用修改了,怎么办呢,比如Image Space有个对象A本来引用了B,但后来更改,A引用C了,于是在GC的时候,需要把B当作垃圾,而CMark。这就需要额外的数据结构ModUnionTable以及CardTable了。
1.1.3  ModUnionTableCardTable
ModUnionTable是需要和CardTable配合使用的。CardTable可以看作是一个数组,每一项被称作Card,占用一个字节的空间。代表Heap中连续的128个字节的内存是否有更改。第0个字节(0号索引)对应Heap起始的128个字节,第1个字节(1号索引)对应Heap中后续的128个字节,以此类推。因此CardTable要覆盖这个Heap的所有连续Space区间(包括Bumper Space)CardTable中的Card可以有3个值,分别是0, 0x70,0x6F. 0代表该项对应的Heap中相应的object有更改。
在做GCMark时,凡是CardTable中属于immune_region的值为0x70Card, 被记录到ModUnionTable 中有一个类似于setc++数据结构中,同时将该Card的值减去1,即0x6F
1.1.4  RememberedSetCardTable
RememberedSetModUnionTable的作用是一样的,不同的是ModUnionTable记录Image Space, Zygote Space对其他space的引用的修改。而RememberedSet记录的是Non Moving SpaceMain SpaceSemi Space引用的修改。RememberedSet虽然只在Semi-Space Collector中起作用,但在CMS的ProcessCards中也会对CardTable中属于Non Moving Space和Main Space的Dirty Card 在RememberedSet中做备份。
1.1.5  Live_BitmapMark_Bitmap
如前所述,Mark Sweep垃圾回收策略采取的是标记跟踪法。Live_BitmapMark_Bitmap便是用来做标记的。他们都是位图的数据结构,每一位代表8个字节的空间。Live_Bitmap记录了当前存在于VM进程中所有的未标记的object和标记过的objectMark_Bitmap经过了Mark 的过程,记录了当前VM进程中所有被引用的object. Live_BitmapMark_Bitmap中间的差集,便是所有为被系统引用的object,即是可以回收的垃圾了。
1.1.6  allocation_stacklive_stack
所有新分配的object会被allocation_stack来记录,然后在Marking的时候,再在Live_Bitmap中打上live的标记。allocation_stacklive_stack其实是一个工作栈和备份栈。当在GC的时候,需要处理allocation_stack,那么会把两个stack互换。Mutator线程新分配的object会压倒备份栈中(备份栈就当作新的工作栈了)。
1.1.7  Concurrent
系统在什么情况下用concurrent,什么时候不用呢?系统定义了6中垃圾收集器,那么具体在做垃圾回收的时候,选择哪一种垃圾收集器呢?在系统初始化的时候,会定义一个collector_type_,一个collector_type_可以选择为:kCollectorTypeSS/kCollectorTypeGSS/kCollectorTypeMS/kCollectorTypeCMS. 对于前三种,都是NonConcurrent的,只有后一种是Concurrent. 目前系统默认选择的是最后一种kCollectorTypeCMS.于是在做垃圾回收的时候,根据sticky/partial/full,选择Concurrentmark-sweep的垃圾收集器。
1.2 Marking Phase
1.2.1 Marking阶段,如果是sticky/partialCMS, 会把Image SpaceZygote Space都加入到Immune_region中。并且,对于stickyCMS,还会将Main SpaceNon moving Spacemark_bitmaplive_bitmap绑定(绑定的意思是mark_bitmaplive_bitmap指向同样的bitmap,这样做的目的是在mark的时候直接把live_bitmapmark_bitmap都标记了,就不会对这两个space做垃圾回收处理了)。Sticky CMS是为了处理新分配出来的对象中的垃圾而不会处理其他Space中的垃圾,因此采用了绑定的方法。而Partial SpaceZygote Space放在immune_region的原因是,Partial CMS不会把Zygote Space完整得扫描一遍,只会扫描在card table上有修改的部分。对于Full CMS,只会把Image Space加入到Immune_region中。可以看出无论是哪一种的CMS都会把Image Space加入到Immune_region中,他们都不会完整的扫描Image Space,因为Image Space里面存放的是boot.artboot.oatVM进程对其更改的可能性比较小,完整地对这个60+MBSpace扫描一遍性价比太低了。
1.2.2  处理CardTable。扫描CardTable,凡是对应于Image SpaceZygote Space的值为0x70(kCardDirty)Card都被插入到ModUnionTableCardCache中,并在CardTable中对该Card进行Aged, Aged的做法就是其值为0x70的减去1,非0x70且非0的设置为0;凡是对应于Main SpaceNon Moving Space的更改的card,都存储在RememberedSet中。其他非Semi SpaceCardTable中有对应的修改的Card, 只进行Aged(好像没有其他的SpaceL).
2227 void Heap::ProcessCards(TimingLogger& timings, bool use_rem_sets) {
2228   // Clear cards and keep track of cards cleared in the mod-union table.
2229   for (const auto& space : continuous_spaces_) {
2230     accounting::ModUnionTable* table = FindModUnionTableFromSpace(space);
2231     accounting::RememberedSet* rem_set = FindRememberedSetFromSpace(space);
2232     if (table != nullptr) {
2233       const char* name = space->IsZygoteSpace() ? “ZygoteModUnionClearCards” :
2234           “ImageModUnionClearCards”;
2235       TimingLogger::ScopedSplit split(name, &timings);
2236       table->ClearCards();
2237     } else if (use_rem_sets && rem_set != nullptr) {
2238       DCHECK(collector::SemiSpace::kUseRememberedSet && collector_type_ == kCollectorTypeGSS)
2239           << static_cast<int>(collector_type_);
2240       TimingLogger::ScopedSplit split(“AllocSpaceRemSetClearCards”, &timings);
2241       rem_set->ClearCards();
2242     } else if (space->GetType() != space::kSpaceTypeBumpPointerSpace) {
2243       TimingLogger::ScopedSplit split(“AllocSpaceClearCards”, &timings);
2244       // No mod union table for the AllocSpace. Age the cards so that the GC knows that these cards
2245       // were dirty before the GC started.
2246       // TODO: Need to use atomic for the case where aged(cleaning thread) -> dirty(other thread)
2247       // -> clean(cleaning thread).
2248       // The races are we either end up with: Aged card, unaged card. Since we have the checkpoint
2249       // roots and then we scan / update mod union tables after. We will always scan either card.
2250       // If we end up with the non aged card, we scan it it in the pause.
2251       card_table_->ModifyCardsAtomic(space->Begin(), space->End(), AgeCardVisitor(), VoidFunctor());
2252     }
2253   }
2254 }
在目前的Heap中,所有的Space22302242行都已经涉及到了,因此2251行似乎很多余。2232行处理的是Image SpaceZygote Space,2237行处理的是Non Moving SpaceMain Space. card table不包含Large Object Space.
1.2.3  MarkRoots. 从各个入口开始Mark操作(可以看作是每一个入口所引用的对象看作一棵树,这样在ART里面所有的对象组成了一个森林,那些没有挂载在根树上的对象都是被回收的对象)。
1.2.4  MarkReachableObjects. 首先会更新immune_region中的ModUnionTable中记录的dirty Card,对之进行Mark。并压入mark_stack栈中。然后进行RecursiveMark,即递归Mark.
至此,所有在根对象的reference链上的对象都已经标记了,貌似已经可以进行Sweep。可是,不要忘了,之前标记是在ConCurrent的情况下,也就是说由于mutator线程仍然在同步地跑着,他们在GC工作的时候仍然可以allocate新的对象,并修改已经Mark过的对象的引用。例如在之前Mark操作的时候A->B,A以及B都已经Mark后,mutator线程新分配了对象C,并将A的引用指向C,即A->C。如果对象C不再次被标记,当作垃圾的话,那么下次访问A的引用的时候,就可能造成Segment Fault了。所以,需要在Suspend所有mutator的情况下再进行一遍Mark. 可是,CMS的目的是尽可能的减少mutator线程pause的时间。于是CMS的策略是,再从头再来标记一遍。
1.2.5 PreCleanCards.
(1)首先会重复步骤1.2.2,不过这一次被操作的对象就是在1.2.21.2.4的过程中新产生的对象或者被修改的对象引用了。
(2)从各个入口开始Mark操作,在操作的过程中对于遇到的没有在mark_bitmapmark过的,在mark_bitmap上做Mark标记。
(3)扫描card table1.2.5刚开始时aged过的card以及在这个过程中dirtycard对应的对象都进行mark操作(即在mark_bitmap上标记,同时丢到mark_stack中)(233行和922行)
(4)递归处理mark_stack中的object并递归处理object的引用。(923行)
233     RecursiveMarkDirtyObjects(false, accounting::CardTable::kCardDirty – 1);
921 void MarkSweep::RecursiveMarkDirtyObjects(bool paused, byte minimum_age) {
922   ScanGrayObjects(paused, minimum_age);
923   ProcessMarkStack(paused);
924 }
1.2.6 前面在ConCurrent的情况下,经过了两轮的扫描,基本上已经扫描的差不多了。但由于mutator是在一直运行的,不可避免地会修改我们之前已经mark过的”地图”。因此,需要第三遍扫描,在Stop-The-World的情况下进行扫描。
62   void PausePhase() EXCLUSIVE_LOCKS_REQUIRED(Locks::mutator_lock_);
这次扫描的时候,除了remarkroot外,还需要扫描cardtableDirty Card的部分.
185     RecursiveMarkDirtyObjects(true, accounting::CardTable::kCardDirty);
1.2.7 经过了前面3次扫描以及Mark,我们的mark”地图已经完备了。可是,我们需要注意,由于Sweep的操作是对应于live_bitmap,在live_bitmap中标记过,却在mark_bitmap中没有标记的视为垃圾。换句话说,mark_bitmap中标记的对象是live_bitmap中标记对象的子集。但目前为止live_bitmap标记的对象还不全,因为allocation_stack中的对象,我们还没有来得及在live_bitmap中做mark. 因此,allocation_stack先”搁置”起来不让后面的mutator使用,启用备份的的live_stack.
2185 void Heap::SwapStacks(Thread* self) {
2186   if (kUseThreadLocalAllocationStack) {
2187     live_stack_->AssertAllZero();
2188   }
2189   allocation_stack_.swap(live_stack_);
2190 }
2 Sweep阶段
Sweep阶段就是把那些白色的对象所占用的内存回收掉。下面这个代码片段基本上就把Sweep的事情说清楚了。
155   for (size_t i = start; i <= end; i++) {
156     word garbage = live[i] & ~mark[i];
157     if (UNLIKELY(garbage != 0)) {
158       uintptr_t ptr_base = IndexToOffset(i) + live_bitmap.heap_begin_;
159       do {
160         const size_t shift = CLZ(garbage);
161         garbage ^= static_cast<size_t>(kWordHighBitMask) >> shift;
162         *pb++ = reinterpret_cast<mirror::Object*>(ptr_base + shift * kAlignment);
163       } while (garbage != 0);
164       // Make sure that there are always enough slots available for an
165       // entire word of one bits.
166       if (pb >= &pointer_buf[buffer_size – kBitsPerWord]) {
167         (*callback)(pb – &pointer_buf[0], &pointer_buf[0], arg);
168         pb = &pointer_buf[0];
169       }
170     }
171   }
156行把live_bitmap中置位的,却没有在mark_bitmap中置位的对象当作垃圾。157170行就是把垃圾占用的空间给释放掉。
3 Finish 阶段
1.2.1的时候,说过,有些情况下会bind mark_bitmap to live_bitmap,这样不会回收某个space的垃圾。因此在这个阶段需要unbind
1541 void Heap::UnBindBitmaps() {
1542   for (const auto& space : GetContinuousSpaces()) {
1543     if (space->IsContinuousMemMapAllocSpace()) {
1544       space::ContinuousMemMapAllocSpace* alloc_space = space->AsContinuousMemMapAllocSpace();
1545       if (alloc_space->HasBoundBitmaps()) {
1546         alloc_space->UnBindBitmaps();
1547       }
1548     }
1549   }
1550 }
另外,之前说过mark_bitmaplive_bitmap的一个子集,mark_bitmap中包含了所有的存活的非垃圾对象,因此交换mark_bitmaplive_bitmap的指针,使mark_bitmap作为下一次GClive_bitmap. 同时reset 新的mark_bitmap
300     SwapBitmaps();
2789 void Heap::ClearMarkedObjects() {
2790   // Clear all of the spaces’ mark bitmaps.
2791   for (const auto& space : GetContinuousSpaces()) {
2792     accounting::ContinuousSpaceBitmap* mark_bitmap = space->GetMarkBitmap();
2793     if (space->GetLiveBitmap() != mark_bitmap) {
2794       mark_bitmap->Clear();
2795     }
2796   }
2797   // Clear the marked objects in the discontinous space object sets.
2798   for (const auto& space : GetDiscontinuousSpaces()) {
2799     space->GetMarkBitmap()->Clear();
2800   }
2801 }
由于mark_stack的目的是为了mark,所以在Finish阶段,也需要把mark_stack给清掉。
1292   mark_stack_->Reset();
Semi-Space Collector
Semi-Space Collector 在做垃圾回收的步骤和Mark Sweep是相同的。不过Semi-Space 是在NonConcurrent的情况下对Heap进行垃圾回收的。这样看起来好像简单了很多J
Semi-Space Collector(下文的SS是其简称)在Mark之前,会指定一个From_SpaceTo_Space. 以及一个promoted_space. 凡是在mark的过程中位于From_Space的对象就会被moveTo_Space中。当mark结束后,From_Space留下来的全是需要清除的垃圾,而To_Space就是mark后存活的对象,并且这些对象都紧凑排列,达到了”压缩”的效果。为了达到generation promotion的效果,在一次Mark结束后,to_space的最后一个对象的边界记录为last_gc_end. 然后to_spacefrom_space交换指针,之前的to_space就作为GC之后的From_space使用,即新的在Bumper Space对象都会在心的From_space中分配,并且分配的位置会在last_gc_end之后。等到下一次Semi-Space Collector来临后,凡是在From_Spacelast_gc_end位置之前存活下来的对象就不再moveto_space了,而是movepromoted_space了,一般promoted_space会指定为main_space. 这就是SSpromotion generation的概念。
当位于from_spaceobjectmoveto_space或者promo_dest_space的时候,老的object->monitor会告诉其他引用的object,新的位置。这样就可以便于更新了。比如:对象AC都引用了对象BA->B C->B. 如果在A引用B的时候扫描后把B搬移到了to_space中新的位置,那么更新A->B’, 同时B->Monitor = 11|B’的地址。这样当CMarking其引用的时候,发现Bmonitor的高两位是11了,就知道B已经被移到新的位置了,只需要改变其引用就好了C->B’.
正如前言所述,Semi-Space Collector会根据情况不同来做GSS还是whole heap collectionSSGSS只会对Bump Pointer Space采用Semi-Space Collector的方法进行垃圾回收。而whole heap collection的方法除了Bump pointer Space外,还会扫描main space, non moving SpaceLarge Object Space.(见下面的代码片段,重点是8588行以及9295行)
71 void SemiSpace::BindBitmaps() {
72   timings_.StartSplit(“BindBitmaps”);
73   WriterMutexLock mu(self_, *Locks::heap_bitmap_lock_);
74   // Mark all of the spaces we never collect as immune.
75   for (const auto& space : GetHeap()->GetContinuousSpaces()) {
76     if (space->GetLiveBitmap() != nullptr) {
77       if (space == to_space_) {
78         CHECK(to_space_->IsContinuousMemMapAllocSpace());
79         to_space_->AsContinuousMemMapAllocSpace()->BindLiveToMarkBitmap();
80       } else if (space->GetGcRetentionPolicy() == space::kGcRetentionPolicyNeverCollect
81                  || space->GetGcRetentionPolicy() == space::kGcRetentionPolicyFullCollect
82                  // Add the main free list space and the non-moving
83                  // space to the immune space if a bump pointer space
84                  // only collection.
85                  || (generational_ && !whole_heap_collection_ &&
86                      (space == GetHeap()->GetNonMovingSpace() ||
87                       space == GetHeap()->GetPrimaryFreeListSpace()))) {
88         CHECK(immune_region_.AddContinuousSpace(space)) << “Failed to add space ” << *space;
89       }
90     }
91   }
92   if (generational_ && !whole_heap_collection_) {
93     // We won’t collect the large object space if a bump pointer space only collection.
94     is_large_object_space_immune_ = true;
95   }
96   timings_.EndSplit();
97 }
8588行,可见,若generational_true, whole_heap_collection_false的情况下,会把non moving spacemain space都会加入到immune_region中。而若whole_heap_collectiontrue,则non moving spacemain space不会加入到immune_region中,并且也会扫描lage object space94行)。Whole_heap_collection除了对bumper pointer spaceSS之外,还会对non moving spacemain spacemark sweep garbage collector.
那么什么时候需要Whole heap collection呢,SemiSpace刚开始的时候是每5Generational Semi-Space Collector,做一次Whole heap collection。后来优化的措施是,被promotedobject数量达到一定阈值后,进行一次Whole heap collection. 或者从上一次whole heap collection 至今分配的Large Object bytes超过了阈值16MBkLargeObjectBytesAllocatedThreshold
注:
[1]. http://www.ibm.com/developerworks/library/j-ibmjava2/
参考

From Li Haifeng's Blog, post Understanding Garbage Collector in ART of Android

Post Footer automatically generated by wp-posturl plugin for wordpress.

Understanding ROS Memory Allocator in ART Virtual Machine

原创文章,转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: Understanding ROS Memory Allocator in ART Virtual Machine

Agenda

1. Run基本结构
2. 主要算法
a)    Allocate
b)    Free
正文

ROS memory allocator Android ART virtual machine GC的一个内存分配器。 ROS Run-of-Slots的缩写。 ROS allocator中,分配的都是虚拟地址区间(具体的物理内存什么时候分配是由kernel中的page fault来负责的)。ROS allocator的基本分配单元是slotslot大小16Bytes2048Bytes,分别是16,32,48,… n*16,512,1024,2048. 不同大小的slot对应不同种类的Run,换句话说,一种Run提供大小相同的slot. 一共有34Run.

, Run的基本结构
一个Run,可以分为两大部分,分别是Run headerData Area。如下所示:
Run header里面包含了
            Magic number: Debug相关
            Size_bracket_idx_: 表示了该Run中的slotsize是多大。比如Size_bracket_idx_10,那么该Run中的slotsize是(10+1*16=176 Bytes.
            Is_thread_local:表示了该Run是否是Thread local area. ROS allocator规定每个线程都会持有一组Thread local Run,该组Runslot size最大为176Bytes. 即,如果某个线程需要分配一个space,若该space不大于176Bytes,那么就从该线程的Run中获得虚拟内存,而不会从ART runtime中获取。
           to_be_bulk_freed_: 该字段是一个flag,用来辅助处理一次释放多个对象的情况。若该标志置位,说明该Run有多个slot正在被释放。若该Runthread local的,那么bulk_free_bitmap的置位信息将会更新到thread_local_free_bitmap中。thread_local_free_bitmap_的信息会在分配thread local slot的时候存在失败的情况下,会更新thread_local_free_bitmap_中的信息到alloc_bitmap_中。若该Run不是thread local的,那么在bulk free的时候,会同步到alloc_bit_map_字段中。alloc_bit_map_是用来在分配slot的时候指示哪些slot已经被分配出去了。
            top_slot_idx_: 一个Run在分配slot的时候,slotRunData Area中的分配顺序是从小到大的。因此该字段表示当前slot分配到哪里了。若需要分配新的slot,只需要在该字段的基础上+1就可以了。可是,在Run分配slot的时候,也会有某些被分配过的slot又被释放的情形。因此若top_slot_idx_到头的情况下,会根据alloc_bit_map_看一下该Run中是否还有空闲的slot可供分配。若有,那么就分配出去。
            alloc_bit_map_[]: Run中的每一个slot在该字段代表一位。若为1,说明slot被分配出去了。若为0,说明该slot还是空闲的。
            bulk_free_bit_map_[]: 每一个slot在该字段中代表1位。若free,置1,该字段会在适当的时候,更新到thread_local_free_bit_map_[]然后在适当的时候更新到alloc_bit_map[]或者直接更新到alloc_bit_map_[],前者对应于该Runthread_local的情况,后者对应于该Run是非thread_local的情况。至于为什么要有这个bulk_free_bit_map_[]字段,是为了性能的考虑,可以减少锁冲突。
            thread_local_free_bit_map_[]: 若该Runthread local的,那么该字段会在slot free的时候置1,当在适当的时候会更新到alloc_bit_map_[]字段中(具体的时机见前面to_be_bulk_freed_的说明)。
            padding 是为了对齐的需要。由于每种Run占用的空间不同(占用的空间包含Run headerData Area),slot的大小,slot的数量都不同,因此三个bitmap(alloc_bit_map_[],bulk_free_bit_map_[],thread_local_free_bit_map_[])所占用的空间大小也不同。而last slot是需要页对其的,因此Data area部分是需要n*slot_size页对其,所以这个padding的空间大小也是每种Run都是不一样的。

slot0,slot 1,…,last slot,是真正的数据区。若没有被分配出去,就是空闲的,若分配出去了,就是已经被使用了。
二,主要算法
1.      分配
当系统向garbage collection请求为一个object分配空间的时候,首先计算该object所占用的空间大小request_size, request_size大于2048Bytes,则直接申请页。否则,根据request_size,查找对应的Run. 系统正在被分配的Run都组织为数组,这些Run数组分为两类,一类是用于分配大于176Bytes的数据块的,另一类是用于分配小于176Bytes的数据块,这类是thread local的。如开头所述,每一slot大小的Run对应于一个索引,slot大小为16Run对应数组中的第0个,数组第n个对应于slot大小为(n+1)*16Run. 找到Run后,
(1)     若该Run为空,表示,该Run还没有向系统申请空间。不过这种情况下并不会立即向系统申请空间,而是会查一下,是否有该种类的Run之前已经全部被分配出去后,又因为有些slot被释放,而又可以被复用的情况。如果有,就用可以复用的Run. 否则,就像系统申请若干虚拟页,每一类的Run在一次填充的时候,需要的页数都是不一样的,具体的页数存放在numOfPages[]数组中。所有slot用完的Run都存放在full_runs_hash_set中。而由于有slot被释放而又有可用的slotRun则存放在non_full_runs_Set中。
(2)     若该Run非空,先查一下top_slot_idx_字段,若该字段并没有到达结尾,就返回一个slot,并把top_slot_idx_字段加1。若top_slot_idx_字段已经到达结尾,那么会搜索一下有没有在该Run中已经分配出去的slot由释放的情况,如果有,分配出去。若该Runslot都用满了,那就看看该Run是不是thread local的,如果是的话,将thread_local_free_bit_map_[]中的信息同步过来,看是否有用的slot. 若至此仍然没有找到一个可用的slot,那把该Run加入到full_runs_hash_set中。重新分配一个Run,向系统申请页. 然后在从该Run中分配slot. 每分配一个slot 就会将alloc_bit_map_中对应的bit1,表示该slot被分配出去了。
2.     释放
若某个object被释放后,根据该object的地址,rundown成页地址,根据页地址去查询该页的用途。对于一个ROS系统所拥有的虚拟页都是连续的,事先通过mmap的方式分配得到。ROS系统维护了一个page_map_的数组。每一页对应一个Byte. 如果该页用于slot分配的Run,Run占用的第一页在page_map_中对应的byte设置为枚举常量kPageMapRun, 剩余页被设置为枚举kPageMapRun。因此当释放某个object的时候,根据地址以及page_map_可以得到Run在内存中的地址。 如果该object属于Largeobjectpage_map_中对应的byte的值为kPageMapLargeObject),那就直接释放该页。
若该页在page_map_中对应的byte的值为kPageMapRun,就在对应的Run中将alloc_bit_map slot的对应位清0.说明该slot空闲了。若该slot对应的Run并非系统正在使用的Run(Run因为slot被用完后,系统又分配了新的slot),那么就将该Runfull_runs_中移到non_full_runs中。如果该Run此时全部为free了,就会将该Run分解掉,其占用的pages退回到free_pages_set中。而如果该Runthread local的,那么该Run是不会被退回的,因为若该Runthread local的情况下,该Run一定属于当前正在使用的Run。如果thread localRun中的slot全部被消耗完了,会被扔到full_runs_中,其thread local的属性也会被去掉。
三,参考

From Li Haifeng's Blog, post Understanding ROS Memory Allocator in ART Virtual Machine

Post Footer automatically generated by wp-posturl plugin for wordpress.

理解ARMV8 Device Memory的三个属性

原创文章,转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: 理解ARMV8 Device Memory的三个属性

ARMV8的spec.对device memory新引入了3个概念,分别是:Gathering, Reordering和Early Write acknowledge.下面从软件工程师的角度去阐述这三个概念。

Gathering

The Gathering attribute determines whether it is permissible for either:
• Multiple memory accesses of the same type, read or write, to the same memory location to be merged into a single transaction.
• Multiple memory accesses of the same type, read or write, to different memory locations to be merged into a single memory transaction on an interconnect.

当某条load/store指令在pipeline的access memory阶段,发生cache miss(对于non-cacheable的memory就直接跳过,不过对于device memory一般都是non-cacheable的)后,需要将access memory请求发到总线上(例如AXI总线),通过AXI总线访问device memory,这个通过总线访问memory的过程被称为一次transaction. 为了优化性能考虑,在core内,会引入一个Buffer, 当发生一个access memory请求后,会将这个请求信息丢到Buffer中,若某条指令和上一条指令共享同一个cache line或者是访问同一个地址,那么,该条指令便不会再向总线发送transaction了,它share上一条指令 读到buffer中的结果。这个share的过程称为gathering. 读到buffer中的数据,如果是cacheable的,会在某个时间内腾到cache中。

Reordering
reordering也是针对device memory的transaction, that is, 某个device memory的连续access的transaction或者device memory之间的连续access的transaction. 对某个device memory的一次transaction 必须等到上一次device memory的transaction结束后(上一次load或者 write的ack发生后)才可以进行,这叫non-reordering. 若这些transaction可以乱序,不需要等待上一次access的ack就可以进行transaction,这种方式叫reordering.

Early Write acknowledge

Early Write Acknowledgement is a hint to the platform memory system. Assigning the No Early Write Acknowledgement attribute to a Device memory location recommends that only the endpoint of the write access returns a write acknowledgement of the access, and that no earlier point in the memory system returns a write acknowledge. This means that a DSB barrier, executed by the processor that performed the write to the No Early Write Acknowledgement location, completes only after the write has reached its endpoint in the memory system. Typically, this endpoint is a peripheral or physical memory.

当对某个Device memory进行write时,需要ACK,下一个对device memory进行access的transaction才能进行。为了性能优化的考虑,device会提供一个类似与寄存器的程序员不可见的buffer,这样当CPU通过总线向device memory写数据的时候,直接写到buffer里就可以了。这样buffer就可以向cpu返回一个ACK,表示写successfully.至于何时buffer将数据真正腾到device memory就看device内部的协议了。

From Li Haifeng's Blog, post 理解ARMV8 Device Memory的三个属性

Post Footer automatically generated by wp-posturl plugin for wordpress.

←Older