Understanding Swap in Linux Kernel

原创文章,转载请注明出处.转载自: Li Haifeng's Blog
本文链接地址: Understanding Swap in Linux Kernel

Agenda

1. 背景

2. swap设备初始化

3. swap slot分配

4. swap cache, page和swap slot的关系

5. 换出

6. 换入

7. 回收swap slot.


1. 背景

当系统中空闲的物理内存不足的时候,需要回收一些正在使用的内存。回收内存主要集中在用户进程的页缓存和匿名内存。页缓存是用来缓存存储设备上的文件内容,而匿名内存是用于缓存堆,栈等内容。当系统中可用的空闲内存不足时,可以把页缓存的内容释放掉或者写入存储设备,这样就释放出了可用的内存。而匿名内存由于它并不和磁盘上的文件对应,因此需要提供一个swap设备用来将匿名内存的内容腾到swap设备上来释放出可用的内存。swap设备可以是一个文件,也可以是一个磁盘分区。本文试图把swap的管理讲清楚。由于swap是属于内存回收的一部分,因此内存回收系统的介绍可以在之前的博客中找到 认识物理内存回收机制以及 How to swap out the anonymous page?。事实上,除了匿名内存会被swap处理外,还有另外两类内存。a).  Dirty pages that belong to a private memory mapping of a process; b). Pages that belong to an IPC shared memory region[1]

当一个页被swap换到swap设备上后,虚拟地址对应的页表项的值需要改写到对应的swap设备上。这个值对于32位系统来说,一共有32位,其中低7位中的高5位(第2~6位)来表示该页的内容被换出到哪一个设备上,第7位~第31位用来表示在swap设备上的offset, 一般情况下若在内存中物理页是4KB,而磁盘上一个磁盘块也是4KB的话,只会使用到高20位,即第11位~31位(4KB对齐嘛)。这样当MMU读取到某个页表项时,发现第0位是0,就会产生一个MMU Fault,然后Kernel处理该Fault,发现页表项其余31位不为0,则说明该缺页的内容在swap上,进行Swap换入操作。

当swap设备中的空间比较紧张的时候(匿名内存换出达到一定数量后),swap设备便需要进行回收,以腾出更多的swap空间用于swap换出。swap回收的主要是那些之前已经在swap中存在,但后来某个进程又将swap的内容换入了,但swap的空间却没有及时释放的情况。

2. Swap设备的初始化

Swap设备可以是一个磁盘分区,也可以是一个文件,他们的启用通过swapon系统调用 。对于文件来说,由于文件才磁盘中对应的磁盘块不一定连续,因此对于文件的swap设备管理起来稍微有点复杂。每个swap设备都对应一个swap_info_struct. 系统默认可以支持32个设备。swap设备的初始化可以简化为与该设备对应的swap_info_struct的初始化。

由于系统可以支持多个swap设备,那么当执行swap换出的时候,到底选择哪一个swap设备呢?每一个swap设备都被设置了一个优先级,数值越大,优先级越高。当swap换出的时候,选择优先级最高的swap设备进行换出。

swap设备在kernel中对应的数据结构是:

178 /*
179  * The in-memory structure used to track swap areas.
180  */
181 struct swap_info_struct {
182         unsigned long   flags;          /* SWP_USED etc: see above */
183         signed short    prio;           /* swap priority of this type */
184         signed char     type;           /* strange name for an index */
185         signed char     next;           /* next type on the swap list */
186         unsigned int    max;            /* extent of the swap_map */
187         unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
188         unsigned int lowest_bit;        /* index of first free in swap_map */
189         unsigned int highest_bit;       /* index of last free in swap_map */
190         unsigned int pages;             /* total of usable pages of swap */
191         unsigned int inuse_pages;       /* number of those currently in use */
192         unsigned int cluster_next;      /* likely index for next allocation */
193         unsigned int cluster_nr;        /* countdown to next cluster search */
194         unsigned int lowest_alloc;      /* while preparing discard cluster */
195         unsigned int highest_alloc;     /* while preparing discard cluster */
196         struct swap_extent *curr_swap_extent;
197         struct swap_extent first_swap_extent;
198         struct block_device *bdev;      /* swap device or bdev of swap file */
199         struct file *swap_file;         /* seldom referenced */
200         unsigned int old_block_size;    /* seldom referenced */
201 #ifdef CONFIG_FRONTSWAP
202         unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
203         atomic_t frontswap_pages;       /* frontswap pages in-use counter */
204 #endif
205         spinlock_t lock;                /*
206                                          * protect map scan related fields like
207                                          * swap_map, lowest_bit, highest_bit,
208                                          * inuse_pages, cluster_next,
209                                          * cluster_nr, lowest_alloc and
210                                          * highest_alloc. other fields are only
211                                          * changed at swapon/swapoff, so are
212                                          * protected by swap_lock. changing
213                                          * flags need hold this lock and
214                                          * swap_lock. If both locks need hold,
215                                          * hold swap_lock first.
216                                          */
217 };

该结构的每一个成员字段都有对应的解释。其中重点说一下,该结构中的flags字段说明该swap的一些属性。他们的解释如下

146 enum {
147         SWP_USED        = (1 << 0),     /* is slot in swap_info[] used? */
148         SWP_WRITEOK     = (1 << 1),     /* ok to write to this swap?    */
149         SWP_DISCARDABLE = (1 << 2),     /* swapon+blkdev support discard */
150         SWP_DISCARDING  = (1 << 3),     /* now discarding a free cluster */
151         SWP_SOLIDSTATE  = (1 << 4),     /* blkdev seeks are cheap */
152         SWP_CONTINUED   = (1 << 5),     /* swap_map has count continuation */
153         SWP_BLKDEV      = (1 << 6),     /* its a block device */
154         SWP_FILE        = (1 << 7),     /* set after swap_activate success */
155                                         /* add others here before... */
156         SWP_SCANNING    = (1 << 8),     /* refcount in scan_swap_map */
157 };

SWP_USED 说明该swap设备可用。

SWP_WRITEOK 说明该swap设备可以写入,每一个swap设备在enable后都是可以写入的。

SWP_DISCARDABLE说明该设备是否支持discard(关于discard,请查阅man swapon中的—discard参数以及参考[2]),一般情况下只用于SSD,并且如果SSD支持TRIM或者discard的话性能会有所提高。

SWP_SOLIDSTATE 对于SSD设备的上的swap file,需要设置该位。这在查找free swap page的时候会用到。

SWP_CONTINUED 的用法,请看swap回收中的对应描述。

SWP_BLKDEV, SWP_FILE 是说明该swap设备是一个分区,还是一个文件。

SWP_SCANNING,当系统在某个swap中寻找空闲的页槽时,便会置上该位,查找结束后清掉。

另外,swap_map 是一个数组,每一个swap slot对应一个字节的map,用于计算该slot的使用计数。

3. Swap槽的分配

当有匿名页需要换出的时候,便需要现在swap设备上分配一个空闲的槽。分配槽大致分为两步,第一步是定位swap设备,第二步从swap设备上寻找一个空闲的槽。

关于定位swap设备,系统定义了一个swap_list数据结构来辅助定位。swap_list有两个成员,一个是head,另一个是next。无论何时,在查找swap设备的时候都是从swap_list.next开始的。

416 swp_entry_t get_swap_page(void)
417 {
418         struct swap_info_struct *si;
419         pgoff_t offset;
420         int type, next;
421         int wrapped = 0;
422         int hp_index;
423
424         spin_lock(&swap_lock);
425         if (atomic_long_read(&nr_swap_pages) <= 0)
426                 goto noswap;
427         atomic_long_dec(&nr_swap_pages);//Available Pages in swap device.
428
429         for (type = swap_list.next; type >= 0 && wrapped < 2; type = next) {
430                 hp_index = atomic_xchg(&highest_priority_index, -1);//Store -1, and return previous value.
431                 /*
432                  * highest_priority_index records current highest priority swap
433                  * type which just frees swap entries. If its priority is
434                  * higher than that of swap_list.next swap type, we use it.  It
435                  * isn't protected by swap_lock, so it can be an invalid value
436                  * if the corresponding swap type is swapoff. We double check
437                  * the flags here. It's even possible the swap type is swapoff
438                  * and swapon again and its priority is changed. In such rare
439                  * case, low prority swap type might be used, but eventually
440                  * high priority swap will be used after several rounds of
441                  * swap.
442                  */
443                 if (hp_index != -1 && hp_index != type &&
444                     swap_info[type]->prio < swap_info[hp_index]->prio &&
445                     (swap_info[hp_index]->flags & SWP_WRITEOK)) {
446                         type = hp_index;
447                         swap_list.next = type;
448                 }
449
450                 si = swap_info[type];
451                 next = si->next;
452                 if (next < 0 ||
453                     (!wrapped && si->prio != swap_info[next]->prio)) {
454                         next = swap_list.head;
455                         wrapped++;
456                 }
457
458                 spin_lock(&si->lock);
459                 if (!si->highest_bit) {
460                         spin_unlock(&si->lock);
461                         continue;
462                 }
463                 if (!(si->flags & SWP_WRITEOK)) {
464                         spin_unlock(&si->lock);
465                         continue;
466                 }
467
468                 swap_list.next = next;
469
470                 spin_unlock(&swap_lock);
471                 /* This is called for allocating swap entry for cache */
472                 offset = scan_swap_map(si, SWAP_HAS_CACHE);
473                 spin_unlock(&si->lock);
474                 if (offset)
475                         return swp_entry(type, offset);
476                 spin_lock(&swap_lock);
477                 next = swap_list.next;
478         }
479
480         atomic_long_inc(&nr_swap_pages);
481 noswap:
482         spin_unlock(&swap_lock);
483         return (swp_entry_t) {0};
484 }

首先是要定位到合适的swap file或者swap 设备。所有的可用swap 都枚举在数组swap_info[]中。但第一个优先选用的swap file的index会记录在swap_list的next成员中。因此,第429行的for循环是从swap_list.next开始的。系统中还有一个全局的变量是highest_priority_index,记录了系统中优先级最好的swap file. 那么在swap_list.next的file和highest_priority_index该选择哪一个呢?就看谁的优先级最高就选哪一个(见第443行~445行的比较)。

接着,对选出来的swap file进行进一步的验证,有空间可以分配(459~462行),该swap是可以写入的(463~466行),同时设置swap_list.next指针。以使swap的分布在各个swap上尽量的均匀。

当swap file选出来以后,就需要通过scan_swap_map在该swap file上分配一个空闲的slot了。scan_swap_map中的参数usage,设置为SWAP_HAS_CACHE说明该swap对应的物理页也在内存中存在于address_space中。如果该swap页槽没有被address_space引用,该标志会被清掉。

get_swap_page-> scan_swap_map:

187 static unsigned long scan_swap_map(struct swap_info_struct *si,
188                                    unsigned char usage)
189 {
190         unsigned long offset;
191         unsigned long scan_base;
192         unsigned long last_in_cluster = 0;
193         int latency_ration = LATENCY_LIMIT;
194         int found_free_cluster = 0;
195
196         /*
197          * We try to cluster swap pages by allocating them sequentially
198          * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
199          * way, however, we resort to first-free allocation, starting
200          * a new cluster.  This prevents us from scattering swap pages
201          * all over the entire swap partition, so that we reduce
202          * overall disk seek times between swap pages.  -- sct
203          * But we do now try to find an empty cluster.  -Andrea
204          * And we let swap pages go all over an SSD partition.  Hugh
205          */
206
207         si->flags += SWP_SCANNING;
208         scan_base = offset = si->cluster_next;
209
210         if (unlikely(!si->cluster_nr--)) {
211                 if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
212                         si->cluster_nr = SWAPFILE_CLUSTER - 1;
213                         goto checks;
214                 }
215                 if (si->flags & SWP_DISCARDABLE) {
216                         /*
217                          * Start range check on racing allocations, in case
218                          * they overlap the cluster we eventually decide on
219                          * (we scan without swap_lock to allow preemption).
220                          * It's hardly conceivable that cluster_nr could be
221                          * wrapped during our scan, but don't depend on it.
222                          */
223                         if (si->lowest_alloc)
224                                 goto checks;
225                         si->lowest_alloc = si->max;
226                         si->highest_alloc = 0;
227                 }
228                 spin_unlock(&si->lock);
229
230                 /*
231                  * If seek is expensive, start searching for new cluster from
232                  * start of partition, to minimize the span of allocated swap.
233                  * But if seek is cheap, search from our current position, so
234                  * that swap is allocated from all over the partition: if the
235                  * Flash Translation Layer only remaps within limited zones,
236                  * we don't want to wear out the first zone too quickly.
237                  */
238                 if (!(si->flags & SWP_SOLIDSTATE))
239                         scan_base = offset = si->lowest_bit;
240                 last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
241
242                 /* Locate the first empty (unaligned) cluster */
243                 for (; last_in_cluster <= si->highest_bit; offset++) {
244                         if (si->swap_map[offset])
245                                 last_in_cluster = offset + SWAPFILE_CLUSTER;
246                         else if (offset == last_in_cluster) {
247                                 spin_lock(&si->lock);
248                                 offset -= SWAPFILE_CLUSTER - 1;
249                                 si->cluster_next = offset;
250                                 si->cluster_nr = SWAPFILE_CLUSTER - 1;
251                                 found_free_cluster = 1;
252                                 goto checks;
253                         }
254                         if (unlikely(--latency_ration < 0)) {
255                                 cond_resched();
256                                 latency_ration = LATENCY_LIMIT;
257                         }
258                 }
259
260                 offset = si->lowest_bit;
261                 last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
262
263                 /* Locate the first empty (unaligned) cluster */
264                 for (; last_in_cluster < scan_base; offset++) {
265                         if (si->swap_map[offset])
266                                 last_in_cluster = offset + SWAPFILE_CLUSTER;
267                         else if (offset == last_in_cluster) {
268                                 spin_lock(&si->lock);
269                                 offset -= SWAPFILE_CLUSTER - 1;
270                                 si->cluster_next = offset;
271                                 si->cluster_nr = SWAPFILE_CLUSTER - 1;
272                                 found_free_cluster = 1;
273                                 goto checks;
274                         }
275                         if (unlikely(--latency_ration < 0)) {
276                                 cond_resched();
277                                 latency_ration = LATENCY_LIMIT;
278                         }
279                 }
280
281                 offset = scan_base;
282                 spin_lock(&si->lock);
283                 si->cluster_nr = SWAPFILE_CLUSTER - 1;
284                 si->lowest_alloc = 0;
285         }
286
287 checks:
288         if (!(si->flags & SWP_WRITEOK))
289                 goto no_page;
290         if (!si->highest_bit)
291                 goto no_page;
292         if (offset > si->highest_bit)
293                 scan_base = offset = si->lowest_bit;
294
295         /* reuse swap entry of cache-only swap if not busy. */
296         if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
297                 int swap_was_freed;
298                 spin_unlock(&si->lock);
299                 swap_was_freed = __try_to_reclaim_swap(si, offset);
300                 spin_lock(&si->lock);
301                 /* entry was freed successfully, try to use this again */
302                 if (swap_was_freed)
303                         goto checks;
304                 goto scan; /* check next one */
305         }
306
307         if (si->swap_map[offset])
308                 goto scan;
309
310         if (offset == si->lowest_bit)
311                 si->lowest_bit++;
312         if (offset == si->highest_bit)
313                 si->highest_bit--;
314         si->inuse_pages++;
315         if (si->inuse_pages == si->pages) {
316                 si->lowest_bit = si->max;
317                 si->highest_bit = 0;
318         }
319         si->swap_map[offset] = usage;
320         si->cluster_next = offset + 1;
321         si->flags -= SWP_SCANNING;
322
323         if (si->lowest_alloc) {
324                 /*
325                  * Only set when SWP_DISCARDABLE, and there's a scan
326                  * for a free cluster in progress or just completed.
327                  */
328                 if (found_free_cluster) {
329                         /*
330                          * To optimize wear-levelling, discard the
331                          * old data of the cluster, taking care not to
332                          * discard any of its pages that have already
333                          * been allocated by racing tasks (offset has
334                          * already stepped over any at the beginning).
335                          */
336                         if (offset < si->highest_alloc &&
337                             si->lowest_alloc <= last_in_cluster)
338                                 last_in_cluster = si->lowest_alloc - 1;
339                         si->flags |= SWP_DISCARDING;
340                         spin_unlock(&si->lock);
341
342                         if (offset < last_in_cluster)
343                                 discard_swap_cluster(si, offset,
344                                         last_in_cluster - offset + 1);
345
346                         spin_lock(&si->lock);
347                         si->lowest_alloc = 0;
348                         si->flags &= ~SWP_DISCARDING;
349
350                         smp_mb();       /* wake_up_bit advises this */
351                         wake_up_bit(&si->flags, ilog2(SWP_DISCARDING));
352
353                 } else if (si->flags & SWP_DISCARDING) {
354                         /*
355                          * Delay using pages allocated by racing tasks
356                          * until the whole discard has been issued. We
357                          * could defer that delay until swap_writepage,
358                          * but it's easier to keep this self-contained.
359                          */
360                         spin_unlock(&si->lock);
361                         wait_on_bit(&si->flags, ilog2(SWP_DISCARDING),
362                                 wait_for_discard, TASK_UNINTERRUPTIBLE);
363                         spin_lock(&si->lock);
364                 } else {
365                         /*
366                          * Note pages allocated by racing tasks while
367                          * scan for a free cluster is in progress, so
368                          * that its final discard can exclude them.
369                          */
370                         if (offset < si->lowest_alloc)
371                                 si->lowest_alloc = offset;
372                         if (offset > si->highest_alloc)
373                                 si->highest_alloc = offset;
374                 }
375         }
376         return offset;
377
378 scan:
379         spin_unlock(&si->lock);
380         while (++offset <= si->highest_bit) {
381                 if (!si->swap_map[offset]) {
382                         spin_lock(&si->lock);
383                         goto checks;
384                 }
385                 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
386                         spin_lock(&si->lock);
387                         goto checks;
388                 }
389                 if (unlikely(--latency_ration < 0)) {
390                         cond_resched();
391                         latency_ration = LATENCY_LIMIT;
392                 }
393         }
394         offset = si->lowest_bit;
395         while (++offset < scan_base) {
396                 if (!si->swap_map[offset]) {
397                         spin_lock(&si->lock);
398                         goto checks;
399                 }
400                 if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
401                         spin_lock(&si->lock);
402                         goto checks;
403                 }
404                 if (unlikely(--latency_ration < 0)) {
405                         cond_resched();
406                         latency_ration = LATENCY_LIMIT;
407                 }
408         }
409         spin_lock(&si->lock);
410
411 no_page:
412         si->flags -= SWP_SCANNING;
413         return 0;
414 }

空槽的查找是在某一个cluster中的。每一个cluster有255个空槽。当槽中的slot都分配完后,就需要再查找一个空的cluster了。210~285行,便是寻找新的cluster的过程。243~258行如果到了highest_bit找不下空的cluster.就从lowest_bit开始再找一遍(263~279行)。如果还找不下,那就scan处,从头到尾扫描(380~393)然后遇到SWAP_HAS_CACHE的槽,就进行回收(385行)。

243~258行,试图寻找到一个有255个空槽的cluster,这样以后的分配都在该cluster中了。323行~376行,还专门针对支持TRIM特性的SSD做了一个优化。当找到一个空的cluster后,就对该cluster进行一个Discard.

总之,当某个swap_map 的entry为空,就说明对应的slot可供分配。

4. Swap Cache,Swap槽和Page的关系

swap cache 并不是真的cache,而是一种log方式。为什么要有swap cache呢? 看一种情况:当在物理内存中的一个物理页被多个进程引用的时候,如果要被换出,那么该物理页在多个进程中的页表项就都需要改写为swap slot的地址。当有某个进程要使用被换出的物理页时,就会在物理内存中分配一页,然后将swap 中的内容copy过去。这时候,其他指向swap slot的进程的页表项里放的仍然是swap slot的地址。如果其他进程需要使用swap slot的内容了,怎么办? 当然是共用之前进程已经swap in的物理内存,然后swap slot对应的计数减1就可以了。其他进程怎么知道swap slot的内容已经在物理内存中有copy呢?就是依靠swap cache.

更详细的解释,请参考 Understanding the Linux Kernel 3rd, 17.4.6 Chapter The Swap Cache 一节.

5. 换出(Swap Out)

换出的操作,即,在内存回收中,需要对某些暂时不会被用到的匿名页进行释放以得到更多的空闲物理内存,匿名页的内容暂存在swap device/file上。所以换出操作包含两步,首先需要找到暂时不会被用到的匿名页,其次,对这些暂时不用的匿名页进行换出。

查找匿名页采用的是近似的LRU算法。具体可参考之前对内存回收的分析博客。

对匿名页进行换出,需要现在swap device/file上查找一块空闲的space,然后将该页置位dirty状态,挂载到读写块队列中,利用磁盘驱动进行写入。在swap disk/file上查找空闲space在上一节的swap 分配中已经分析过了。而页内容写到磁盘属于块设备读写内容,不做过多分析。这里需要注意的一点是,假若一个页被多个进程引用,那么在写出时,怎么能保证被多个进程引用的物理页,只被写出一次,并写到同一个swap entry上呢。另外,做在换入操作时,当某个A进程已经将在swap页读入到内存中了,而其他共享该页的进程B的页表项仍然指向swap entry,当B进程需要访问在swap entry的页时,怎么能保证不再重复读A已经读进来的swap页呢?这两个问题的解决依靠address_space. address_space是一个基数树。每一个swap device/file都有一个address_space。这个基数树中,key是swap_entry,而数据是指向物理页的指针。关于swap_entry,物理页以及address_space的关系,请看swap_entry, physical page和 address_space一节。

当找到一个free的swap slot后,就要进行换出

 77 /*
 78  * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
 79  * but sets SwapCache flag and private instead of mapping and index.
 80  */
 81 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 82 {
 83         int error;
 84         struct address_space *address_space;
 85
 86         VM_BUG_ON(!PageLocked(page));
 87         VM_BUG_ON(PageSwapCache(page));
 88         VM_BUG_ON(!PageSwapBacked(page));
 89
 90         page_cache_get(page);
 91         SetPageSwapCache(page);
 92         set_page_private(page, entry.val);
 93
 94         address_space = swap_address_space(entry);
 95         spin_lock_irq(&address_space->tree_lock);
 96         error = radix_tree_insert(&address_space->page_tree,
 97                                         entry.val, page);
 98         if (likely(!error)) {
 99                 address_space->nrpages++;
100                 __inc_zone_page_state(page, NR_FILE_PAGES);
101                 INC_CACHE_INFO(add_total);
102         }
103         spin_unlock_irq(&address_space->tree_lock);
104
105         if (unlikely(error)) {
106                 /*
107                  * Only the context which have set SWAP_HAS_CACHE flag
108                  * would call add_to_swap_cache().
109                  * So add_to_swap_cache() doesn't returns -EEXIST.
110                  */
111                 VM_BUG_ON(error == -EEXIST);
112                 set_page_private(page, 0UL);
113                 ClearPageSwapCache(page);
114                 page_cache_release(page);
115         }
116
117         return error;
118 }

对于换出的page,首先需要对该page的使用计数加1(90行),然后设置PG_swapcache标志,最后设置page->private的value为swap_entry的值(在swap disk/file中的offset). 最终会插入到address_space的基数树中。

6. 换入(Swap In)

当一个进程中的某个物理页被换出到swap area后,如果再次访问便会发生缺页中断。缺页中断的处理函数如果发现页表项中得值非零,但invalid。就认为需要进行swap in了。swap in的对应处理函数是:do_swap_page。

handle_pte_fault->do_swap_page:

2994 /*
2995  * We enter with non-exclusive mmap_sem (to exclude vma changes,
2996  * but allow concurrent faults), and pte mapped but not yet locked.
2997  * We return with mmap_sem still held, but pte unmapped and unlocked.
2998  */
2999 static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
3000                 unsigned long address, pte_t *page_table, pmd_t *pmd,
3001                 unsigned int flags, pte_t orig_pte)
3002 {
3003         spinlock_t *ptl;
3004         struct page *page, *swapcache;
3005         swp_entry_t entry;
3006         pte_t pte;
3007         int locked;
3008         struct mem_cgroup *ptr;
3009         int exclusive = 0;
3010         int ret = 0;
3011 
3012         if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
3013                 goto out;
3014 
3015         entry = pte_to_swp_entry(orig_pte);
3016         if (unlikely(non_swap_entry(entry))) {
3017                 if (is_migration_entry(entry)) {
3018                         migration_entry_wait(mm, pmd, address);
3019                 } else if (is_hwpoison_entry(entry)) {
3020                         ret = VM_FAULT_HWPOISON;
3021                 } else {
3022                         print_bad_pte(vma, address, orig_pte, NULL);
3023                         ret = VM_FAULT_SIGBUS;
3024                 }
3025                 goto out;
3026         }
3027         delayacct_set_flag(DELAYACCT_PF_SWAPIN);
3028         page = lookup_swap_cache(entry);
3029         if (!page) {
3030                 page = swapin_readahead(entry, 
3031                                         GFP_HIGHUSER_MOVABLE, vma, address);
…...
3117         pte = mk_pte(page, vma->vm_page_prot);
3118         if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
3119                 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
3120                 flags &= ~FAULT_FLAG_WRITE;
3121                 ret |= VM_FAULT_WRITE;
3122                 exclusive = 1;
3123         }
3124         flush_icache_page(vma, page);
3125         if (pte_swp_soft_dirty(orig_pte))
3126                 pte = pte_mksoft_dirty(pte);
3127         set_pte_at(mm, address, page_table, pte);
3128         if (page == swapcache)
3129                 do_page_add_anon_rmap(page, vma, address, exclusive);
3130         else /* ksm created a completely new copy */
3131                 page_add_new_anon_rmap(page, vma, address);
3132         /* It's better to call commit-charge after rmap is established */
3133         mem_cgroup_commit_charge_swapin(page, ptr);
3134 
3135         swap_free(entry);
3136         if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3137                 try_to_free_swap(page);
3138         unlock_page(page);
3139         if (page != swapcache) {
3140                 /*
3141                  * Hold the lock to avoid the swap entry to be reused
3142                  * until we take the PT lock for the pte_same() check
3143                  * (to avoid false positives from pte_same). For
3144                  * further safety release the lock after the swap_free
3145                  * so that the swap count won't change under a
3146                  * parallel locked swapcache.
3147                  */
3148                 unlock_page(swapcache);
3149                 page_cache_release(swapcache);
3150         }
3151 
3152         if (flags & FAULT_FLAG_WRITE) {
3153                 ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
3154                 if (ret & VM_FAULT_ERROR)
3155                         ret &= VM_FAULT_ERROR;
3156                 goto out;
3157         }
3158 
3159         /* No need to invalidate - it was non-present before */
3160         update_mmu_cache(vma, address, page_table);
3161 unlock:
3162         pte_unmap_unlock(page_table, ptl);
3163 out:
3164         return ret;
3165 out_nomap:
3166         mem_cgroup_cancel_charge_swapin(ptr);
3167         pte_unmap_unlock(page_table, ptl);
3168 out_page:
3169         unlock_page(page);
3170 out_release:
3171         page_cache_release(page);
3172         if (page != swapcache) {
3173                 unlock_page(swapcache);
3174                 page_cache_release(swapcache);
3175         }
3176         return ret;
3177 }

3012行是为了避免重复处理,若该Pte已经被其他进程或者其他core处理了,就不用再处理了。

3015行,根据pte获得其swap slot地址。

3016行~3026行是做错误或者其他检查。

3027行是设置该线程对应的struct tast_struct的task_delay_info.flags,我正在做swap in 呢. 3054 4062行是其对照,取消掉标志。说明已经做完了。

由于某个物理页可能被多个进程引用,因此,当该物理页被其他进程已经copy到物理内存中,即能够在该swap area的swap cache中查找到该页(3028行),就不需要再次读入了。若没有在swap cache中,就需要驱动存储控制器,读入swap area中对应的内容到物理内存中(3030行)。当这些事情都做完后,就更新页表(3117和3127行)。其次,要对swap area中对应slot的引用计数进行维护(3135行)。

7. 回收Swap

当swap disk/file 比较满的时候,就需要回收一些空间了。swap区满的判断标准是:

/* Swap 50% full? Release swapcache more aggressively.. */
408 static inline bool vm_swap_full(void)
409 {
410         return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages;
411 }

可以看出,只要达到了50%的swap空间,就认为swap 区需要回收一些空间了,那么问题来了,哪些被分配出去的swap可以被回收? 能满足以下条件的,可以被回收:

1. 该swap entry在swapcache中存在(PG_swapcache被置位)。即,物理内存中同样有一份该swap entry中内容的copy。

2. 页没有被回写(PG_writeback没有被置位),说明该swap entry在物理内存中存在的copy是后来被换入的,而不是正在被换出的。

3. 该swap entry的引用计数是0,即该swap entry已经不被任何进程所引用。这一点,容易和第一点相矛盾。 注意,在释放该swap space之前,已经把该swap entry从address space中删掉了,但PG_swapcache仍然置位。

当一个swap entry从address_space(也就是swapcache)中被删除后,

862 /*
 863  * Called after dropping swapcache to decrease refcnt to swap entries.
 864  */
 865 void swapcache_free(swp_entry_t entry, struct page *page)
 866 {
 867         struct swap_info_struct *p;
 868         unsigned char count;
 869
 870         p = swap_info_get(entry);
 871         if (p) {
 872                 count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
 873                 if (page)
 874                         mem_cgroup_uncharge_swapcache(page, entry, count != 0);
 875                 spin_unlock(&p->lock);
 876         }
 877 }

swapcache_free->swap_entry_free:

 前面已经把swap_entry从swapcache的基数树中删除了,因此下面需要把该swap_entry在swapmap[]中对应的计数减一。由于之前swap_entry存在在swapcache中,因此swapmap[]中对应的计数count的SWAP_HAS_CACHE被置位,所以,swapcache_free中调用swap_entry_free也需要传入SWAP_HAS_CACHE参数。下文会看到,若传入SWAP_HAS_CACHE参数,那么swap_entry在swap_map中对应的SWAP_HAS_CACHE一定被置位(800~803行)。

789 static unsigned char swap_entry_free(struct swap_info_struct *p,
 790                                      swp_entry_t entry, unsigned char usage)
 791 {
 792         unsigned long offset = swp_offset(entry);
 793         unsigned char count;
 794         unsigned char has_cache;
 795
 796         count = p->swap_map[offset];
 797         has_cache = count & SWAP_HAS_CACHE;
 798         count &= ~SWAP_HAS_CACHE;
 799
 800         if (usage == SWAP_HAS_CACHE) {
 801                 VM_BUG_ON(!has_cache);
 802                 has_cache = 0;
 803         } else if (count == SWAP_MAP_SHMEM) {
 804                 /*
 805                  * Or we could insist on shmem.c using a special
 806                  * swap_shmem_free() and free_shmem_swap_and_cache()...
 807                  */
 808                 count = 0;
 809         } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
 810                 if (count == COUNT_CONTINUED) {
 811                         if (swap_count_continued(p, offset, count))
 812                                 count = SWAP_MAP_MAX | COUNT_CONTINUED;
 813                         else
 814                                 count = SWAP_MAP_MAX;
 815                 } else
 816                         count--;
 817         }
 818
 819         if (!count)
 820                 mem_cgroup_uncharge_swap(entry);
 821
 822         usage = count | has_cache;
 823         p->swap_map[offset] = usage;
 824
 825         /* free if no reference */
 826         if (!usage) {
 827                 dec_cluster_info_page(p, p->cluster_info, offset);
 828                 if (offset < p->lowest_bit)
 829                         p->lowest_bit = offset;
 830                 if (offset > p->highest_bit)
 831                         p->highest_bit = offset;
 832                 set_highest_priority_index(p->type);
 833                 atomic_long_inc(&nr_swap_pages);
 834                 p->inuse_pages--;
 835                 frontswap_invalidate_page(p->type, offset);
 836                 if (p->flags & SWP_BLKDEV) {
 837                         struct gendisk *disk = p->bdev->bd_disk;
 838                         if (disk->fops->swap_slot_free_notify)
 839                                 disk->fops->swap_slot_free_notify(p->bdev,
 840                                                                   offset);
 841                 }
 842         }
 843
 844         return usage;
 845 }

从826行,可以看出,若在swap device/file中,某个swap slot的引用计数为0,就可以被free了。那么一个slot的引用计数如前所述,在每一个swap设备或者swap file中都会有一个swap_map. swap_map是一个unsigned char* 的数组。swap设备中的每个页顺序对应该数组的一项,该数组用来记录对应页的使用计数。比如一个swap entry有n个PTE映射,那么就需要计数为N。系统规定了计数的最大值是0x7f,即十进制数127.那么计数超过127怎么办呢,系统就分配一页用作swap_map的延续。然后在延续的swap_map上对应的offset处再重新计数,并把上一个offset的值设置为0x80。对应的patch(570a335b8):

+/*
+ * add_swap_count_continuation - called when a swap count is duplicated
+ * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
+ * page of the original vmalloc'ed swap_map, to hold the continuation count
+ * (for that entry and for its neighbouring PAGE_SIZE swap entries).  Called
+ * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
+ *
+ * These continuation pages are seldom referenced: the common paths all work
+ * on the original swap_map, only referring to a continuation page when the
+ * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
+ *
+ * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
+ * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
+ * can be called after dropping locks.
+ */
+int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
+{
+       struct swap_info_struct *si;
+       struct page *head;
+       struct page *page;
+       struct page *list_page;
+       pgoff_t offset;
+       unsigned char count;
+
+       /*
+        * When debugging, it's easier to use __GFP_ZERO here; but it's better
+        * for latency not to zero a page while GFP_ATOMIC and holding locks.
+        */
+       page = alloc_page(gfp_mask | __GFP_HIGHMEM);
+
+       si = swap_info_get(entry);
+       if (!si) {
+               /*
+                * An acceptable race has occurred since the failing
+                * __swap_duplicate(): the swap entry has been freed,
+                * perhaps even the whole swap_map cleared for swapoff.
+                */
+               goto outer;
+       }
+
+       offset = swp_offset(entry);
+       count = si->swap_map[offset] & ~SWAP_HAS_CACHE;
+
+       if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
+               /*
+                * The higher the swap count, the more likely it is that tasks
+                * will race to add swap count continuation: we need to avoid
+                * over-provisioning.
+                */
+               goto out;
+       }
+
+       if (!page) {
+               spin_unlock(&swap_lock);
+               return -ENOMEM;
+       }
+
+       /*
+        * We are fortunate that although vmalloc_to_page uses pte_offset_map,
+        * no architecture is using highmem pages for kernel pagetables: so it
+        * will not corrupt the GFP_ATOMIC caller's atomic pagetable kmaps.
+        */
+       head = vmalloc_to_page(si->swap_map + offset);
+       offset &= ~PAGE_MASK;
+
+       /*
+        * Page allocation does not initialize the page's lru field,
+        * but it does always reset its private field.
+        */
+       if (!page_private(head)) {
+               BUG_ON(count & COUNT_CONTINUED);
+               INIT_LIST_HEAD(&head->lru);
+               set_page_private(head, SWP_CONTINUED);
+               si->flags |= SWP_CONTINUED;
+       }
+
+       list_for_each_entry(list_page, &head->lru, lru) {
+               unsigned char *map;
+
+               /*
+                * If the previous map said no continuation, but we've found
+                * a continuation page, free our allocation and use this one.
+                */
+               if (!(count & COUNT_CONTINUED))
+                       goto out;
+
+               map = kmap_atomic(list_page, KM_USER0) + offset;
+               count = *map;
+               kunmap_atomic(map, KM_USER0);
+
+               /*
+                * If this continuation count now has some space in it,
+                * free our allocation and use this one.
+                */
+               if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
+                       goto out;
+       }
+
+       list_add_tail(&page->lru, &head->lru);
+       page = NULL;                    /* now it's attached, don't free it */
+out:
+       spin_unlock(&swap_lock);
+outer:
+       if (page)
+               __free_page(page);
+       return 0;
+}

参考:

1.  Understanding the Linux Kernel 3rd.

2. 【无聊科普】固态硬盘(SSD)为什么需要TRIM?http://www.guokr.com/blog/483475/

From Li Haifeng's Blog, post Understanding Swap in Linux Kernel

Post Footer automatically generated by wp-posturl plugin for wordpress.

分享到: