www.atoptool.nl May 2010 This docuent descri!es the analysis of a slo" syste suffering fro a #rocess "ith a eory leakage$ %uch #rocess regularly re&uests for ore dynaic eory "ith the su!routine malloc' "hile the #rograer has (forgotten) to free the re&uested eory again$ *n this "ay the #rocess gro"s virtually as "ell as #hysically$ Mainly !y the physical gro"th +the #rocess, resident set si-e increases.' the #rocess inflates like a !alloon and #ushes other #rocesses out of ain eory$ *nstead of a healthy syste "here #rocesses reach a #ro#er !alance in their eory consu#tion' the total syste #erforance ight degrade /ust !y one #rocess that leaks eory$ 0otice that the Linu1 kernel does not liit the #hysical eory consu#tion of #rocesses$ 2very #rocess' either running under root identity or non3root identity' can gro" unliited$ *n the last section of this case study' soe suggestions "ill !e given to decrease the influence of leaking #rocesses on your overall syste #erforance$ *n order to !e a!le to inter#ret the figures #roduced !y atop' !asic kno"ledge of Linu1 eory anageent is re&uired$ The ne1t sections descri!e the utili-ation of #hysical eory as "ell as the i#act of virtual eory !efore focussing on the details of the case itself$ Introduction to physical memory The #hysical eory +45M. of your syste is su!divided in e&ually3si-ed #ortions' called memory pages$ The si-e of a eory #age de#ends on the C673architecture and the settings issued !y the o#erating syste$ Let,s assue for this article that the si-e of a eory #age is 8 9i:$ 5t the oent that the syste is !ooted' the co#ressed kernel iage kno"n as the file /boot/vmlinuz-.... is loaded and deco#ressed in eory$ This static #art of the kernel is loaded soe"here at the !eginning of the 45M eory$ The running kernel re&uires ho"ever ore s#ace' e$g$ for the adinistration of #rocesses' o#en files' net"ork sockets' $$$ !ut also to load dynaic loada!le odules$ Therefore' the kernel dynaically allocates eory using so3called (sla! caches)' in short slab$ When this dynaically allocated kernel eory is not needed any ore +reoval of #rocess adinistration "hen a #rocess e1its' unloading a loaded odule' $$$$.' the kernel ight free this eory$ This eans that the sla! s#ace "ill shrink again$ 0otice that all #ages in use !y the kernel are eory resident and "ill never !e s"a##ed$ 5#art fro the kernel' also #rocesses re&uire #hysical #ages for their te1t +code.' static data and stack$ The #hysical s#ace consued !y a #rocess is called (4esident %et %i-e)' in short RSS$ ;o" a #age !ecoes #art of the 4%% "ill !e discussed in the ne1t section$ The reaining #art of the #hysical eory after the kernel and #rocesses have taken their share is ainly used for the page cache$ The #age cache kee#s as uch data as #ossi!le fro the disks +filesystes. in eory in order to i#rove the access s#eed to disk data$ The #age cache consists of t"o #arts< the #art "here the data !locks of files are stored and the #art "here the etadata !locks +su#er!locks' inodes' !ita#s' $$$. of filesystes are stored$ The latter #art is called the (!uffer cache)$ Most tools +like free' top' atop. sho" t"o se#arate values for these #arts' res#$ (cached) and (!uffer)$ The su of these t"o values is the total si-e of the #age cache$ The si-e of the #age cache varies$ *f there is #lenty free eory' the #age cache "ill gro" and if there is a lack of eory' the #age cache "ill shrink again$ =inally' the kernel kee#s a #ool of free #ages to !e a!le to fullfill a re&uest for a ne" #age and deliver it fro stock straight a"ay$ When the nu!er of free #ages dro#s !elo" a #articular threshold' #ages that are currently occu#ied "ill !e freed and added to the free #age #ool$ %uch #age can !e retrieved fro a #rocess +current #age contents ight have to !e s"a##ed to s"a# s#ace first. or it can !e stolen fro the #age cache +current #age contents ight have to !e flushed to the filesyste first.$ *n the first case' the 4%% of the concerning #rocess shrinks$ *n the second case' the si-e of the #age cache shrinks$ 2ven the sla! ight shrink in case of a lack of free #ages$ 5lso the sla! contains data that is /ust eant to s#eed u# certain echaniss' !ut can !e shrunk in case of eory #ressure$ 5n e1a#le is the incore inode cache that contains inodes of files that are currently o#en' !ut also contains inodes of files that have recently !een o#en !ut are currently closed$ That last category is ke#t in eory' /ust in case such file "ill !e o#ened again in the near future +saves another inode retrieval fro disk.$ *f needed ho"ever' the incore inodes of closed files can !e reoved$ 5nother e1a#le is the directory name cache +dentry cache. that holds the naes of recently accessed files and directories$ The dentry cache is eant to s#eed u# the #athnae resolution !y avoiding accesses to disk$ *n case of eory #ressure' the least3recently accessed naes ight !e reoved to shrink the sla!$ *n the out#ut of atop' the si-es of the eory co#onents that have /ust !een discussed can !e found< *n the line la!eled M2M' the si-e of the 45M eory is sho"n +tot.' the eory that is currently free +free.' the si-e of the #age cache +cache+buff. and the si-e of the dynaically allocated kernel eory +slab.$ The 4%% of the #rocesses can !e found in the #rocess list +the lo"er #art of screen.< The eory details +su!coand ,m,. sho" the current 4%% #er #rocess in the colun 4%*>2 and +as a #ercentage of the total eory installed. in the colun M2M$ The #hysical gro"th of the #rocess during the last interval is sho"n in the colun 4G4?W$ Introduction to virtual memory When a ne" #rogra is activated' the kernel constructs a virtual addess s#ace for the ne" #rocess$ This virtual address s#ace descri!es all eory that the #rocess could #ossi!ly use$ =or a starting #rocess' the si-e of the virtual s#ace is ainly deterined !y the te1t +T. and data +@. #ages in the e1ecuta!le file' "ith a fe" additional #ages for the #rocess, stack$ 0otice that the #rocess does not consue any #hysical eory yet during its early startu# stage$ The illustration sho"s an e1ecuta!le file "ith A29i: +B #ages. of te1t and A29i: +B #ages. of static data$ The kernel has /ust !uilt a virtual address s#ace for the #ages of the e1ecuta!le file and 8 additional #ages +e$g$ for stack.$ 5fter the virtual address s#ace has !een !uilt' the kernel fills the #rogra counter register of the C67 "ith the address of the first instruction to !e e1ecuted$ The C67 tries to fetch this instruction' !ut notices that the concerning te1t #age is not in eory$ Therefore the C67 generates a fault +tra#.$ The fault handling routine of the kernel "ill load the re&uested te1t #age fro the e1ecuta!le file and restarts the #rocess at the soe #oint$ 0o" the C67 is a!le to fetch the first instruction and e1ecutes it$ When a !ranch is ade to an instruction that lies in another te1t #age' that #age "ill !e loaded via fault handling as "ell$ The kernel "ill load a data #age as soon as a reference is ade to a static varia!le$ *n this "ay' any #age can !e #hysically loaded into eory at its first reference$ 6ages that are not referenced at all' "ill not !e loaded into eory$ The illustration sho"s that the #rocess has a virtual si-e of B09i: +20 #ages. and a #hysical si-e +4%%. of 1C9i: +8 #ages.$ ?!viously' the #hysical si-e is al"ays a su!set of the virtual si-e and can never !e larger than that$ %u##ose that the #rocess allocates eory dynaically +"ith the su!routine malloc.' the re&uested s#ace "ill initially only e1tend the #rocess, virtual address s#ace$ ?nly "hen the #roces really refers to a #age in the dynaic area' that #age is #hysically created and filled "ith !inary -eroes$ The illustration sho"s that the first alloc,ed area has a vrtual si-e of 8B9i: +12 #ages. "hile only 1C9i: +8 #ages. are #hysically created !y a reference$ 5lso the #ages in the second and third alloc,ed s#ace have not all !een referenced$ *n the #rocess list sho"n !y atop' inforation can !e found a!out the virtual address s#ace< The eory details +su!coand ,m,. sho" the current virtual si-e #er #rocess in the colun D%*>2$ The virtual gro"th of a #rocess during the last interval is sho"n in the colun DG4?W$ =or #rocess simpress.bin "ith #id A8EC a virtual gro"th of 1C2B9i: is sho"n +#ro!a!ly !y issueing a malloc. "hile the resident gro"th is 1EFC9i:$ 6rocess chrome "ith #id 2F2G2 has not !een gro"n virtually +09i:. !ut has referenced #ages +C0 #ages of 89i:. during the last interval that have !een allocated virtually during earlier intervals$ 6rocess Xorg "ith #id 1EA0 has only !een gro"n virtually +C11C9i:. !ut has not referenced any of the ne" #ages +yet.$ 6rocess firefox "ith #id 8CB0 has freed alloc,ed s#ace "ith a virtual si-e of 2G29i:$ 5##arently' this i#lies the release of 2G29i: resident s#ace$ Case study: A quiet system The first atop sna#shot "as taken "hen a #rogra that leaks eory has /ust !een started' called lekker +@utch for (leaker).$ *n this sna#shot "e see a syste "ith net A$BGi: of #hysical eory of "hich 1BCMi: is free and ore than 1Gi: in the #age cache +M2M line< cache + buff.$ The kernel has dynaically allocated 2GAMi: sla!$ Thus' soe 2$AGi: is in use !y a##lication #rocesses$ %"a# s#ace +BGi:. is alost unused +so far$$$.$ We can see that the lekker #rocess has gro"n 1ECMi! +virtual. during the last interval "hich "as also ade resident !y really referencing the allocated s#ace$ =or no"' there is no reason to !e "orried$ 6lease take notice of the si1 upload #rocesses$ They have allocated 2ECMi: each +virtual and resident.$ We can also see ulti#le chrome #rocesses that ay have a large virtual si-e' !ut only a sall #ortion of that is resident< the cuulated virtual si-e is 8$1Gi: +1H1H0$CH1$E Gi:. of "hich 22FMi: +10AHE2H88HA0 Mi:. is resident$ The #rocess simpress.bin has ade less than 10I of its virtual foot#rint +1$1Gi:. resident +F1Mi:.$ 5lso firefox has a relatively sall #ortion of its virtual foot#rint resident$ 5 lot of these virtual si-es "ill !e shared' not only for the sae e1ecuta!le file +8 chrome #rocesses share at least the sae code.' !ut also for shared li!rary code used !y all #rocesses$ Till no" the syste has (#roised) A$8Gi: of virtual eory +M2M line' vmcom. of the total liit F$FGi: +vmlim' "hich is the si-e of s"a# s#ace #lus half of the #hysical eory.$ Case study: It's getting a bit busy... ?ne sna#shot of t"enty seconds later' lekker has gro"n another 1E0Mi: +virtual and resident.$ =or a large #art that could !e claied fro the free s#ace' !ut not entirely$ The nu!er of free #ages in stock is getting very lo"' so the kernel tries to free eory soe"here else$ We can see that the first victis are the #age cache and the sla!$ They !oth have to shrink$ The M2M line is dis#layed in cyan !ecause the aount of eory that can &uickly !e claied is sall +free #lus the #age cache.$ The #rocesses are not yet in the danger -one !ecause no #ages are s"a##ed out +65G line' swout.$ :etter yet' firefox has #hysically referenced another BF29i: +that is a sall aount co#ared to the 1ECMi: that lekker got.$ We can see that chrome has shrunk !y 289i: +C #ages.$ We later found out that this "as caused !y alloc,ed eory #ages that "ere freed' follo"ed !y a ne" alloc of C #ages "ithout referencing the #ages again$ ;ence the virtual and resident si-e at first shrunk !y C #ages' after "hich only the virtual si-e gre" !y C #ages$ 5fter all' the resident si-e shrunk "ithout a change of the virtual si-e$ Case study: The kernel gets worried.... =our sna#shots of 20 seconds later' "e see that lekker has an unsatisfia!le hunger< it has gro"n ore than C00Mi: +virtual and resident. since the #revious screen shot' "hich is a!out 1E0Mi: #er 20 seconds$ The #age cache has !een shrunk as "ell as the sla! +e$g$ in3core inodes and directory entries.$ 6rocesses "eren,t s#ared either$ They didn,t shrink virtually' !ut soe of their resident #ages "ere taken a"ay !y the kernel +negative RGROW.$ :ecause the "orried +!ut not yet des#arate. kernel is looking hard for eory to free' ore and ore #ages are checked !y the #age scanner< A0G12 #ages "ere verified to see if they are candidate to !e reoved fro eory +65G line' scan.$ *f a #age has to !e reoved fro eory that "as odified' that #age has to !e saved to the s"a# s#ace$ *n this sna#shot' A22 #ages "ere "ritten to the s"a# disk +65G line' swout.$ This resulted in A22 "rites to the s"a# s#ace logical volue +vg00-lvswap. that "ere co!ined to a!out C0 "rites to the #hysical disk +sda.$ :ecause so any #ages "ere s"a##ed out in a short tie' the 65G line is dis#layed in red$ ;o"ever' #rocesses don,t sit still and soe of the #ages that "ere s"a##ed out "ill !e referenced again$ These #ages are read again "hich ha##ened F1 ties +65G line' swin.$ =ortunately atop itself akes all its #ages resident at startu# and locks the in eory' thus #reventing the to !e s"a##ed out and aking the easureents unrelia!le$ Case study: The kernel gets desparate as well as the users... The eory3leaking #rocess lekker cannot !e sto##ed$ We fast3for"ard E inutes< :y no"' lekker has gro"n to a virtual si-e of 2$BGi: of "hich 2Gi: is resident +ore than E0I of the #hysical eory of the syste.$ *n the #ast 20 seconds' lekker has tried to get hold of 12BMi: ore virtual eory !ut has only !een a!le to ake A8Mi: resident$ We kno" fro the #ast that lekker tries to ake all its virtual eory resident as soon as it can' so "e can conclude that the kernel is very !usy s"a##ing out #ages$ The #age cache has already !een inii-ed' as "ell as the inode cache and directory entry cache +#art of the sla!.$ ?!viously the #rocesses "ill also have to (donate) #hysical eory$ ?ne of the upload #rocesses +6*@ A0F2C. is even donating 1FMi:$ We can see for soe of the upload #rocesses that they had to give !ack &uite a lot ore #hysical eory +RSIZE.$ They had C ties 2ECMi:' of "hich they no" have only t"o thirds$ The syste is s"a##ing out heavily +A21AE #ages in the last 20 seconds. !ut is also s"a##ing in +B880 #ages.$ :ecause of this' the 65G line is dis#layed in red$ The disk is very !usy +the @%9 as "ell as LDM lines are red.$ The average service ties of re&uests for the logical volues that are not related to s"a##ing +lvhome and lvusr. are getting longer !ecause the re&uests to those areas are s"a#ed !y re&uests to s"a# s#ace +lvswap.$ 5lthough a relatively sall nu!er of re&uests are related to lvusr and lvhome' these logical volues are !usy res#ectively GEI and F2I of the tie$ The syste feels e1treely slo" no"$ Tie to get rid of the leaking #rocess$$$$$ Case study: Relief... =ive inutes later' the !ig s#ender lekker has finished and thus not using eory any ore< ;o"ever' the effect of lekker as a eory hog can !e noticed for a long tie$ We can see that the upload #rocesses are slo"ly referencing their s"a##ed3out #ages resulting in a resident gro"th again$ :ecause there is an ocean of free s#ace +1$B Gi:.' nothing is s"a##ed out any ore and hardly any scanning +65G line' scan.$ We see a lot of a/or #age faults +M5J=LT. for #rocesses< references to virtual #ages that are retrieved fro disk$ 2ither they "ere s"a##ed out and no" have to !e s"a##ed in' or they are read fro the e1ecuta!le file$ The inor #age faults +M*0=LT. are references to virtual #ages that can !e ade resident "ithout loading the #age fro disk< #ages that need to !e filled "ith -eroes +e$g$ for alloc,s. or #ages that "ere (accidentally) still availa!le in the free #age #ool$ The disk is still very !usy retrieving the virtual #ages that are referenced again and need to !e s"a##ed in +swin is GAB8 "hich corres#onds to read for logical volue lvswap.$ Therefore the @%9 and soe LDM lines are sho"n in red$ The #hysical disk sda is ainly !usy due to the re&uests of logical volue lvswap$ ;o"ever this also slo"s do"n the re&uests issued for the other logical volues$ ?ne re&uest to lvtmp even takes GABsJ Case study: Life's almost good again... More than seven inutes later' "e can see that the syste is alost tran&uil again< There is far less disk *K? and certainly not all disk *K? is related to s"a##ing any ore$ 6rocesses +like upload. still do not have all their resident eory !ack' !ecause they si#ly haven,t touched all of their virtual #ages since the (stor) has #assed$ 6ro!a!ly any of these #ages have !een used during their initiali-ation #hase and "ill not even !e referenced any ore$ 5s such a sall !rease ight hel# to clean u# a dusty eory' ho"ever a story leaker as lekker can !etter !e avoided$$$$$ Possible solutions for memory leakage =ro the case study' it is clear that only one is!ehaving #rocess can cause a heavy #erforance degradation for the entire syste$ The ost o!vious solution is to solve the eory leakage in the guilty #rogra and take care that every alloc,ed area is sooner or later freed again$ ;o"ever' in #ractice this ight not !e a trivial task since the leaking #rogra "ill often !e #art of a third3#arty a##lication$ %u##ose that a real solution is not #ossi!le +for the tie !eing.' it should !e #ossi!le to avoid that the leaking #rocess is !othering other #rocesses$ 6refera!ly it should only har its o"n #erforance !y liiting the resident eory that the leaking #rocess is allo"ed to consue$ %o not allo"ing the !alloon to e1#and unliited +#ushing out the others.' !ut #utting a !o"l around it redirecting the su#erfluous e1#ansions outside the !o1$$$$ The good ne"s is< there is a standard uliit value to liit the resident eory of a #rocess$ $ ulimit -a .... max memory size (kbytes, -m) unlimited The default value is (unliited)$ The coand ulimit can !e used to set a liit on the resident eory consu#tion of the shell and the #rocesses started !y this shell< $ ulimit -m 409600 $ lekker & The !ad ne"s ho"ever is< this ethod only "orks "ith kernel version 2$8' !ut not any ore "ith kernel version 2$C +duy value.$ :ut there is other good ne"s +"ithout related !ad ne"s this tie.< *n the current 2$C kernels a ne" echanis is introduced called container groups +cgrou#s.$ Dia cgrou#s it is #ossi!le to #artition a set of #rocesses +threads. and s#ecify certain resource liits for such #artition +container.$ 5 cgrou# can !e created for all kind of resources' also for eory$ *t is !eyond the sco#e of this docuent to go into detail a!out cgrou#s' !ut a sall e1a#le can already illustrate the #o"er of this echanis$ Cgrou#s are i#leented via a filesyste odule' so first of all the virtual cgrou# filesyste +"ith o#tion (eory). should !e ounted to an ar!itrary directory$ This ount has to !e done only once after !oot' so it,s !etter to s#ecify it in your /etc/fstab file< # mkdir /cgroups/memo # mount -t cgroup -o memory none /cgroups/memo
To define a ne" eory cgrou# for the leaking #rocess+es.< 1$ Create a su!directory !elo" the ount #oint of the virtual cgrou# filesyste< # mkdir /cgroups/memo/leakers 5t the oent that you create a su!directory' it is agically filled "ith all kind of #seudo files and su!directories that can !e used to control the #ro#erties of this cgrou#$ 2$ ?ne of the (files) in the ne"ly created su!directory is called memory.limit_in_bytes and can !e used to set the total eory liit for all #rocesses that "ill run in this cgrou#< # echo 420M > /cgroup/memo/leakers/memory.limit_in_bytes A$ 5nother (file) in the ne"ly created directory is called tasks and can !e used to s#ecify the id,s of the #rocessesKthreads that ust !e #art of the cgrou#$ *f you assign a #rocess to a cgrou#' also its descendents +started fro then on. "ill !e #art of that cgrou#$ %u##ose that the leaking #rocess lekker runs "ith 6*@ 2C2G' it can !e assigned to the cgrou# leakers as follo"s< # echo 2627 > /cgroup/memo/leakers/tasks 0o" the leaking #rocess can not use ore resident eory than 820Mi:$ When it runs' atop ight sho" the follo"ing out#ut< The line la!eled M2M sho"s that 1$FGi: eory is free$ ?n the other hand' the line la!eled 65G sho"s that a lot of #ages have !een s"a##ed out$ The #rocess lekker has already gro"n to A$CGi: virtual eory +D%*>2.' !ut it only uses AG8Mi: resident eory +4%*>2.$ @uring the last sa#le' the #rocess has even gro"n 1A2Mi: virtually +DG4?W.' !ut it has shrunk 2CMi: #hysically +4G4?W.$ 5nd "hat,s ore i#ortant' the other #rocesses are not hared any ore !y the leaking #rocess$ Their resident gro"th is not negative$ The leakage is not fi1ed' though under control$$$