<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8812827877261189081</id><updated>2012-02-16T11:10:41.034-08:00</updated><category term='Apache Harmony'/><category term='Garbage collection'/><category term='JVM'/><category term='Google Android'/><category term='Programming'/><category term='Multi-core'/><title type='text'>Xiao-Feng Li</title><subtitle type='html'>On runtime technology and programming languages</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>60</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1884557326441945951</id><published>2011-11-23T18:24:00.000-08:00</published><updated>2011-11-23T18:24:39.421-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Google Android'/><title type='text'>Quantify and optimize the user experience of Google Android</title><content type='html'>Traditional performance evaluation methodology has been quite mature and somehow sophisticated with hardware support in micro-architecture and lots of tools available. But the evaluation methodology for user experience is not yet well established. The root causes are two-fold:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Before mobile client prevails recently, the user experience of a computing system was basically just GUI + performance. When it came to user experience, people thought mostly about human-computer-interface with various media approaches such as speech, virtual reality etc. that had never become the mainstream or actual reality.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;The traditional user experience research has been mostly focused on user subjective perception. So the academia on this front spent their efforts in perceptual model, eye-tracking, polls, sweating, etc. that are not directly useful to software engineering.  &lt;br /&gt;&lt;/ol&gt;The community did not realize that, there have been already enough room for the industry to software engineering the user experience - based on the established industry principles in responsiveness, smoothness, coherence, accuracy/fuzziness, etc.I have recently prepared a series of slide decks on the software engineering methodology of "user experience optimization" with Google Android: &lt;ul&gt;&lt;li&gt; Overall methodology of &lt;a href="http://people.apache.org/~xli/presentations/Android-user-experience-optimization-external.pdf"&gt;quantify and optimize User Interactions with Android devices&lt;/a&gt;.&lt;br /&gt;&lt;li&gt; The &lt;a href="http://people.apache.org/~xli/presentations/Android-workload-suite-external.pdf"&gt;Android Workload Suite&lt;/a&gt; used for Android user interactions evaluation.&lt;br /&gt;&lt;li&gt; The &lt;a href="http://people.apache.org/~xli/presentations/Android-UXtune-toolkit-external.pdf"&gt;Android UXtune toolkit&lt;/a&gt; used to assist the analysis and optimization of Android user interactions.&lt;br /&gt;&lt;/ul&gt;The content is available in my homepage at Apache: &lt;a href="http://people.apache.org/~xli/"&gt;http://people.apache.org/~xli/&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1884557326441945951?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1884557326441945951/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1884557326441945951' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1884557326441945951'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1884557326441945951'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2011/11/quantify-and-optimize-user-experience.html' title='Quantify and optimize the user experience of Google Android'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-6845307890202732179</id><published>2009-08-04T22:26:00.000-07:00</published><updated>2009-08-04T22:38:45.450-07:00</updated><title type='text'>Transactional memory regarded not good programming model</title><content type='html'>After I posted my blog entry &lt;a href="http://xiao-feng.blogspot.com/2007/04/can-transactional-memory-be-new.html"&gt;Can "transactional memory" be a new programming model?&lt;/a&gt; two years ago, I have been watching closely on the opinions from academia on TM as a programming model. Recently I read two papers on the same topic, one from Bohem [2] arguing that TM should only be an implementation technique, the other from C. J. Rossbach, et al [3] demonstrated with experiments that programming with TM is not necessarily easier than with coarse-grained locks.&lt;br /&gt;&lt;br /&gt;[1] Can "transactional memory " be a new programming model? http://xiao-feng.blogspot.com/2007/04/can-transactional-memory-be-new.html&lt;br /&gt;[2] Hans-J. Boehm, Transactional Memory Should Be an Implementation Technique, Not a Programming Interface, HotPar, Berkeley, CA. March 30, 2009&lt;br /&gt;[3] Christopher J. Rossbach, Owen S. Hofmann, and Emmett Witchel, Is Transactional Programming Actually Easier? 8th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), Austin, Texas June 2009&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-6845307890202732179?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/6845307890202732179/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=6845307890202732179' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6845307890202732179'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6845307890202732179'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2009/08/transactional-memory-regarded-not-good.html' title='Transactional memory regarded not good programming model'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5373370774170171332</id><published>2009-04-14T19:15:00.000-07:00</published><updated>2009-04-14T19:38:52.984-07:00</updated><title type='text'>A quick guide on Tick - the Harmony concurrent GC</title><content type='html'>I've written a &lt;a href="http://people.apache.org/~xli/presentations/harmony_tick_concurrent_gc.pdf"&gt; slide deck on Tick &lt;/a&gt; [1], the Harmony concurrent GC we developed. Tick has been there in Harmony for about one year, now I got some time to put down its design and implementation. I do not expect people can immediately understand all the internals of Tick after reading the guide, but it should help those who want to dive into Tick or those who want to write their own concurrent GC. The immediate target of this document to help the GSoC2009 project with Tick.&lt;br /&gt;&lt;br /&gt;Concurrent GC is a super-set of stop-the-world GC, in my opinion. It meets all the challenges in a STW GC and many beyond. The key challenges in my mind (based on my experience with Tick) are: &lt;br /&gt;&lt;br /&gt;1. The interaction between mutators and collectors. Here in my slides I refer as the phase transition control. The idea was not so clear at the beginning of Tick development, but we then realized it is simply boiled down into a central state-machine control by mutator. &lt;br /&gt;&lt;br /&gt;2. The termination control. It is easy to understand that to terminate the marking process, we need guarantee the global root set, collector local mark stacks, mutator local remember sets, and global remember set all be empty. But there are two subtleties in real implementations. a) the checks must be in order; b) concurrent access to mutator local remset by mutator and collector.&lt;br /&gt;&lt;br /&gt;3. The collection triggering scheduler. It can not collect too early so as to waste the system resource when there are lots of free memory; also it can not be too late to become virtually STW collection, hence losing all the Tick design target. A proper triggering scheduler should consider both space and timing issues.&lt;br /&gt;&lt;br /&gt;4. Keep collection overhead small. Concurrent GC wants to achieve short pause time, the expense is to lower the overall system throughput. It is a challenge to balance the design between pause time and system throughput. For example, to improve the performance, multiple collectors can be deployed in parallel.&lt;br /&gt;&lt;br /&gt;I would like to spend more time to document the internals of Tick. Stay tuned.&lt;br /&gt;&lt;br /&gt;[1] http://people.apache.org/~xli/presentations/harmony_tick_concurrent_gc.pdf&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5373370774170171332?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5373370774170171332/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5373370774170171332' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5373370774170171332'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5373370774170171332'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2009/04/quick-guide-on-tick-harmony-concurrent.html' title='A quick guide on Tick - the Harmony concurrent GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7110965773839990682</id><published>2009-04-01T05:25:00.000-07:00</published><updated>2009-04-01T05:30:10.972-07:00</updated><title type='text'>Two Harmony project proposals for GSoC 2009</title><content type='html'>Check link below [1] for application process. For Harmony projects, it is required to discuss the proposal in the mailing list dev@harmony.apache.org .&lt;br /&gt;&lt;br /&gt;&lt;B&gt;1. Modularize Harmony JIT by separating JET as a standalone JIT compiler&lt;/B&gt;&lt;br /&gt;&lt;br /&gt;So far the JIT component (called Jitrino) of Harmony has virtually two JIT implementations: JET and OPT. Jitrino.JET is a fast but non-optimizing JIT, and Jitrino.OPT is an optimizing JIT. The code base of JET and OPT shares lots of code hence they are mixed in one module. This is undesirable for situations where people need only JET, for fast compilation, for small footprint. This project proposes to create a standalone JET-based JIT module for Harmony. It does not require to remove JET from Jitrino, but to create a new JIT module with JET. This project is also a very good exercise to examine the JIT modularity design, the interface between JIT and other components, the interaction between multiple co-existing JIT modules.&lt;br /&gt;&lt;br /&gt;&lt;B&gt;2. Implement WeakReference support in Harmony concurrent GC&lt;/B&gt;&lt;br /&gt;&lt;br /&gt;Harmony already has a concurrent GC (called Tick, with three concurrent GC algorithms). It runs well with standard benchmarks. The only remaining unfinished feature is WeakReference support . Weakly referenced object (i.e., referent) is accessed through get() interface. That means, get() operation can make a weakly reachable referent strongly reachable. During concurrent collection, the system must monitor the get() operation to catch this change of reachability, otherwise the referent could be reclaimed. This project also includes to integrate the WeakReference processing with Finalization process. Other optimizations in Tick are also desirable, such as to reduce the amount of floating garbage.&lt;br /&gt;&lt;br /&gt;[1] http://wiki.apache.org/general/SummerOfCode2009&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7110965773839990682?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7110965773839990682/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7110965773839990682' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7110965773839990682'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7110965773839990682'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2009/04/two-harmony-project-proposals-for-gsoc.html' title='Two Harmony project proposals for GSoC 2009'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-2006575849610037875</id><published>2008-08-23T06:11:00.000-07:00</published><updated>2008-08-24T19:21:52.050-07:00</updated><title type='text'>Thread mapping: 1:1 vs M:N</title><content type='html'>This is an old essay I wrote about 7 years ago, when I was investigating what kind of thread mapping is better for a runtime system (especially the ORP [1] JVM I was working on). It is worth to put it here as one of the series of blog articles. I edited it a little bit to incorporate some new information.&lt;br /&gt;&lt;br /&gt;I discussed the concept of thread mapping, 1:1 or M:N, in blog entry: What is a thread. [2] M:N thread binding is "conceivably" better than 1:1 because of cheap context switch. It is believed to be cheap because it does not need to involve OS kernel operations for user level thread switch. So an M:N threading normally has following setting: there are processor number pf kernel threads; unlimited number of user threads multiplexes over the kernel threads. Since kernel context switches are expensive, it's expected there are no kernel thread context switches at all except OS kernel process scheduling. All the context switches are expected to happen only in the user-level scheduler. As a comparison, in 1:1 threading all the context switches happen in kernel.&lt;br /&gt;&lt;br /&gt;&lt;B&gt;&lt;font size=+1&gt;1. Context Switch&lt;/font&gt;&lt;/B&gt;&lt;br /&gt;But this conception is not necessarily true. The kernel thread context switch has no more operations than user thread context switch, except that, kernel switch requires to trap into the kernel. So the only additional overhead is the system call. But if we study the real nature of context switch, we can see the truth is not so obvious.&lt;br /&gt;&lt;br /&gt;&lt;B&gt;Scenario 1: Blocking operations&lt;/B&gt; &lt;br /&gt;Context switches mostly happen when a thread is doing some I/O operations and blocks in kernel. Since it blocks in kernel, it's natural to schedule on-site by the kernel to context switch to another kernel thread; this is the approach of 1:1 binding. The trapping overhead is there already. &lt;br /&gt;&lt;br /&gt;In M:N mapping, there are two ways to continue another user thread in the same kernel thread context:&lt;br /&gt;&lt;br /&gt;1. Scheduler activation implemented in kernel. When a blocking happens, kernel can vector the event to user scheduler via an upcall. Then the user scheduler can determine which user thread to schedule next. But this is cumbersome because it requires one upcall which is an extra overhead compared to the kernel scheduling; and the user scheduler is not necessarily as efficient as kernel scheduler due to the immaturity compared to the OS kernel shceudler;&lt;br /&gt;&lt;br /&gt;2. Non-blocking system call. If OS doesn't provide scheduler activation, the only way to continue a user thread when another blocks is to use non-blocking system call. That is, whenever a thread is calling a possibly blocking syscall, it first polls it to check if it is going to block. If it is, the user scheduler will schedule another thread, and polls again in next scheduling cycle. This solution seems ok, but the polling itself is a system call, which involves everything in a blocking system call except the context switch. So user-level scheduling consists of a non-blocking system call + user thread context switch. The cost is equal to a blocking system call + kernel context switch. &lt;br /&gt;&lt;br /&gt;Although it is no better, this is the ideal case of user-level scheduling. In common cases, the non-blocking simulation of blocking syscall involves much more than stated here. It usually requires a syscall to save the current blocking status of the file descriptor and another syscall to restore it, before and after the polling respectively. And the next scheduling cycle may repeat these operations again to check if the blocking condition is resolved. As a comparison, kernel scheduling is much simpler and cleaner, where the blocking thread is simply put into sleep, and then waken up till the blocking condition is known resolved. &lt;br /&gt;&lt;br /&gt;The reason for this complexity of non-blocking syscall simulation is easy to understand: user scheduler knows nothing about kernel status. It has to use syscall to figure out the blocking status. Better asynchronous I/O support can help to solve the problem, where the user scheduler is notified by the kernel for the async processing, and a completion notification is sent by the kernel upon the processing is finished. This kind of interaction between user threads and kernel is implemented in Windows NT4.0 and Solaris10 and their later as I/O Completion Port. &lt;br /&gt;&lt;br /&gt;&lt;B&gt;Scenario 2: Synchronization&lt;/B&gt; &lt;br /&gt;Another scenario of thread blocking is caused by the synchronizations among threads. This is the case where all the scheduling can be done at user level, since the synchronization implementation could all be in the user runtime. If the synchronization usage is extensive in the application and the contention is intensive, user level scheduling can bring some performance advantage. But again, it is not that simple.&lt;br /&gt;&lt;br /&gt;1. Normally, in the application domain, a highly contending application is unusual and considered sub-optimal in its design. For example, the highly contended SPECpbob was substituted by rarely contended SPECjbb2000. (Note, lots of synchronization operations do not mean lots of contention. They are completely two different issues.) Actually, some source said the thread blockings caused by synchronization take only very minor ratio in all the blockings (&lt; 5%).&lt;br /&gt;&lt;br /&gt;2. Even it is good to use user-level scheduling for thread synchronization blocking, a threading library has to deal with process level mutex, which makes the thread synchronization implementation complicated and less efficient as expected. (Futexes helped here.) &lt;br /&gt;&lt;br /&gt;This is the scenario supporting the idea of M:N mapping; well the benefits are not convincing enough.&lt;br /&gt;&lt;br /&gt;&lt;B&gt;Scenario 3: Uncaught exceptions&lt;/B&gt;&lt;br /&gt;There are lots of situations where thread blockings are hard to be simulated with non-blocking operations, e.g. page fault. Page fault happens in kernel, and is hard to predict. Neither scheduler activation nor a non-blocking wrapper can work around it easily except just be blocked in kernel. This limits the concurrency achievable.&lt;br /&gt;&lt;br /&gt;As stated earlier, the problem is due to the limited information known at user level. Lots of things like processor affinity, intelligent scheduling for hyperthreads or NUMA optimizations, scheduling based on system load outside of the process, etc. are almost impossible. Scheduler activation can help but not a complete solution.&lt;br /&gt;&lt;br /&gt;Moreover, a problem is, all of the scheduling supports are always existing in kernel scheduler. To achieve the scheduling capability at user-level, those supports have to be duplicated at user-level. &lt;br /&gt;&lt;br /&gt;&lt;B&gt;&lt;font size=+1&gt;2. Cooperative user threads&lt;/font&gt;&lt;/B&gt;&lt;br /&gt;Even if the user-level scheduling is effective, is it really a good thing? People may think that user scheduler is better because the user threads are cooperative, and they guess that "cooperative" sounds like a better behavior pattern than preemptive.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Number of scheduling units&lt;/b&gt;&lt;br /&gt;It is easy to believe that best performance is normally achieved when there are equal number of processors and scheduling units (kernel threads). User threads sharing a kernel thread are bound to a processor, and do not need preemptiveness. But this problem is, in nowadays operating system, there are always many running processes at the same time. So even if user threads are cooperative, the kernel thread containing them is already preempted from time to time.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Thread cooperation&lt;/b&gt;&lt;br /&gt;It could really be a good thing if user threads are really cooperative without much extra cost. Unfortunately, since a non-preemptive scheduler depends on application behavior for scheduling, it normally requires the application be aware of the scheduling and yields voluntarily; Otherwise it is easy to be deadlocked or starved. But this awareness is burdensome to the developer and usually does not exist at all. The threading runtime can not always be able to insert yields properly. &lt;br /&gt;&lt;br /&gt;Different from many people's expectation, the preemptive scheduling is more intuitive to application developer than the cooperative one. (The situation is different from the mechanism of stop-the-world for GC in runtime system. I will discuss it later).&lt;br /&gt;&lt;br /&gt;More importantly, preemption might be necessary for any environment where real-time or soft real-time responsiveness is needed, both for rouge-thread cases and for cases where the processor resource has to be given up due to a higher priority thread needing it.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Resource sharing&lt;/b&gt;&lt;br /&gt;An advantage of user threads in one kernel thread might be that, the threads can share everything of the system resource, esp. including the page table and TLB slots, which could lead to better performance due to less misses. Well, the situation is, kernel threads can share these resources too. (This feature needs architectural support. AFAIK, Linux does pretty good on IA32 processor.)  &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Thread creation/cancellation&lt;/b&gt;&lt;br /&gt;One important claimed advantage of user thread is the creation and cancellation cost of user thread is low. This is mostly true if the thread runtime is implemented correctly. But its importance is largely reduced due to its user thread nature. Being not a separate system scheduling unit, it is unclear what is the intention to create tons of user threads that are only running within one kernel thread container. They will compete for the precious time slice of the single kernel thread. &lt;br /&gt;&lt;br /&gt;In reality, most systems do not create or cancel lots of threads in short time. In cases when lots of threads are needed, the applications may want to reduce the creation overhead with thread pooling. Lots of threads creation doesn't necessarily mean lots of concurrency. They may just start and finish frequently. And even if there are lots of concurrency, it's hard to be leveraged because it has to be implemented by scheduling efficiency, which is not so good for user-level threading as we discussed above. &lt;br /&gt;&lt;br /&gt;&lt;B&gt;&lt;font size=+1&gt;3. Obvious disadvantages&lt;/font&gt;&lt;/B&gt;&lt;br /&gt;In spite of the suspicious "advantages" of M:N threading discussed above, user-level scheduler has some obvious disadvantages.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Duplication of the schedulers&lt;/b&gt;&lt;br /&gt;M:N requires two schedulers which basically do same work, one at user level and one in kernel. This is undesirable. It requires frequent data communications between kernel and user space for scheduling information transference. &lt;br /&gt;&lt;br /&gt;One subtler point is, the duplication takes more space in both Dcache and Icache for scheduling than a single scheduler. It is highly undesirable if cache misses are caused by the schedulers but the application, because a L2 cache miss could be more expensive than a kernel thread switch. Then the additional scheduler might become a trouble maker! In this case, to save kernel trappings does not justify a user-scheduler, which is more truen when the processors are providing faster and faster kernel trapping execution.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Thread local data maintenance&lt;/b&gt; &lt;br /&gt;M:N has to maintain thread specific data, which are already provided by kernel for kernel thread, such as the TLS data, error number. To provide the same feature for user threads is not straightforward, because, for example, the error number is returned for system call failure and supported by kernel. User-level support degrades system performance and increases system complexity.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;System info oblivious&lt;/b&gt; &lt;br /&gt;Kernel scheduler is close to underlying platform and architecture. It can take advantage of their features. This is difficult for user thread library because it's a layer at user level. User threads are second-order entities in the system. If a kernel thread uses a GDT slot for TLS data, a user thread perhaps can only use an LDT slot for TLS data. With increasingly more supports available from the new processors for threading/scheduling (Hyperthreading, NUMA, many-core), the second order nature seriously limits the ability of M:N threading. &lt;br /&gt;&lt;br /&gt;This is what I thought on the thread mapping issues. It is a long article, and thanks for reaching here. There are some contents in the original essay that I do not include here. Those are some considerations on the threading needs in a runtime system, such as whether we need suspend/resume API, better inter-thread signaling, etc. I might discuss them in future.&lt;br /&gt;&lt;br /&gt;[1] Open Runtime Platform, http://orp.sourceforge.net &lt;br /&gt;[2] What is a thread? http://xiao-feng.blogspot.com/2008/08/what-is-thread.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-2006575849610037875?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/2006575849610037875/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=2006575849610037875' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2006575849610037875'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2006575849610037875'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/08/thread-mapping-11-vs-mn.html' title='Thread mapping: 1:1 vs M:N'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-8680950075758825200</id><published>2008-08-17T06:39:00.000-07:00</published><updated>2008-08-21T07:29:17.733-07:00</updated><title type='text'>What is thread escape analysis?</title><content type='html'>In last blog entry, I discussed "thread local data" [1]. I mentioned that if some data are known to be thread local, some optimizations can be applied to the application manually or automatically.&lt;br /&gt;&lt;br /&gt;One optimization is, we can put the thread local data into registers. Register is thread local anyway, so the semantics are kept. Register access is much faster than memory access, so this optimization can improve the performance. In compiler, this is done with register allocation, scalar replacement (i.e., use scalars to replace non-scalar data fields, such as array or class), etc. &lt;br /&gt;&lt;br /&gt;Thread local data can be be put on runtime stack as well. Stack is also in memory as heap space, but stack has an advantage that data on it are freed automatically when the stack frame is cleared (when a method returns). In this case, the data actually are not only thread local, but also method local, which is more strict condition than thread local. (some people claim that stack-allocated data have better access locality, but my experience did not confirm that.)&lt;br /&gt;&lt;br /&gt;Even if the thread local data can not be put into register or stack, there are still optimizations applicable. For example, garbage in thread local data can be recycled without stopping other threads for root enumeration. (Well, technique still needed to enumerate roots in global variables).&lt;br /&gt;&lt;br /&gt;No matter where they are, all the thread local data have an important optimizing opportunity: They do not need any locking operations for mutual exclusive access. This is important because locking operation usually is very expensive. &lt;br /&gt;&lt;br /&gt;Then the question is how the compiler can identify some data are thread local automatically. This is called escape analysis. The compiler analyzes the source code to find if the reference of an object is passed to other thread. This analysis has to be conservative to guarantee the correctness. For example, usually when a reference is written to a global variable, it is considered escaping, because the global variable could be read be other thread.&lt;br /&gt;&lt;br /&gt;Escape analysis can also be conducted at runtime without compiler static analysis. That is, when an object is created, it is set thread local to its creating thread. Then the system monitors all the accesses to this object. If it detects any other thread tries to access the object, the system then marks the object to be escaping. This technique is called escape detection instead of escape analysis sometimes. Some of the thread local data optimizations can still be applied to the dynamically detected thread local data, but some static optimizations might not be suitable.&lt;br /&gt;&lt;br /&gt;"Thread local" actually is not necessarily restricted to "accesses". Many operations can be the property of "thread local". For example, if an object is only locked by one thread, it is called thread local lock. Thread local lock can be eliminated, even the object itself is accessed by multiple threads. So thread local data is more restrictive than thread local object. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;[1] What is thread local data? http://xiao-feng.blogspot.com/2008/08/what-is-thread-local-data.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-8680950075758825200?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/8680950075758825200/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=8680950075758825200' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8680950075758825200'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8680950075758825200'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/08/what-is-thread-escape-analysis.html' title='What is thread escape analysis?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4678124290075428263</id><published>2008-08-17T03:03:00.000-07:00</published><updated>2008-08-17T06:33:36.732-07:00</updated><title type='text'>What is thread local data?</title><content type='html'>Thread local data refer to those data owned by one thread. Normally "owned" here means the data are only accessed by that thread. Sometimes, it is also called thread-private data, thread-specific data, thread-local storage, etc. Thread local data are interesting because the property of being owned by a single thread can be utilized to improve the application's performance or the program design.&lt;br /&gt;&lt;br /&gt;There are basically three kinds of thread local data. They are:&lt;ul&gt;&lt;br /&gt;&lt;li&gt; Registers. The registers can only be accessed by single thread (or process, depending on the context) normally. (Yes, some processors have global registers, and even the common registers can be accessed with tricks, but those are out of the scope of my discussion here.)&lt;br /&gt;&lt;li&gt; Stack. This is known to be associated with a specific thread. Sometimes a thread is identified by its stack.&lt;br /&gt;&lt;li&gt; Thread-local heap. Within the shared heap space, a region can be owned by a single thread.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Registers and stack are system-supported thread local data, as we discussed in "What is a thread?" [1]. They cannot be accessed by other threads by design. &lt;br /&gt;&lt;br /&gt;Thread-local heap is different. It is supported not by design, but by convention, because heap is sharable to all threads. A heap region is local to one thread means either of the following two situations:&lt;br /&gt;1. The region is not accessible to other threads. The region can be protected by virtual memory mechanism or whatever technique to enforce the convention, or it is simply a rule complied by all the threads.&lt;br /&gt;2. The region is accessible to all threads, but only one thread actually accesses it (up to the moment). This property of the data is called "non-escape", i.e., they are confined to a single thread's territory. Once the data are accessed by other thread, it becomes "escape". (We will discuss "escape" and "escape analysis" later.)&lt;br /&gt;&lt;br /&gt;Different from the registers or the stack, there is no default system support for thread local heap. Programmers need some way to define, to find, and to use thread local heap. Since every thread can claim thread local regions, they should be able to find its own regions with same API like my_region(). The solution is, we can put the region pointer into the same register or the same stack slot of different threads. Since the registers and the stack are system-supported thread local data, even using the same register name or stack slot, different threads will access their own registers or stack slots. So they can get their own region pointers from the same API my_region(). This is how the current thread libraries implement "thread local storage" or "thread specific data". &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;[1] What is a Thread? http://xiao-feng.blogspot.com/2008/08/what-is-thread.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4678124290075428263?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4678124290075428263/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4678124290075428263' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4678124290075428263'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4678124290075428263'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/08/what-is-thread-local-data.html' title='What is thread local data?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-109737427357728913</id><published>2008-08-17T02:22:00.001-07:00</published><updated>2008-08-17T02:22:40.571-07:00</updated><title type='text'>What is a thread?</title><content type='html'>Some of my friends are confused by all kinds of concepts around thread, such as system thread, kernel thread, native thread, user-level thread, application thread, java thread, software thread, hardware thread, simultaneous multithread (SMT), hyperthread (HT), helper thread, etc, etc.  &lt;br /&gt;&lt;br /&gt;So what a thread is?&lt;br /&gt;&lt;br /&gt;A thread is nothing but a control flow of execution. It is an concept only valid in control-flow machine, because only then is there control flow. Then what is control flow? Or in other words, how to represent a control flow. In my opinion, only two entities are essential to represent a control flow. They are the program counter and the stack.&lt;br /&gt;&lt;br /&gt;Program counter points to next instruction to execute. Stack stores the temporary execution result. To be a meaningful stack, a stack pointer is needed pointing to the next location to store the execution result.&lt;br /&gt;&lt;br /&gt;Program counter and the stack can uniquely identify a control flow of an execution. They cannot be shared with other threads (except for some extreme cases). All of other computing resources can be shared between threads, such as heap, code, processor, etc. Hence, program counter and the stack are called the thread context. &lt;br /&gt;&lt;br /&gt;That means, if a system provides threading support, it should at least provide a way to distinct one thread context from another, be it in software, hardware or hybrid. If processor hardware provides thread context support, it is hardware thread. Different hardware threads can share same processor pipeline (SMT) or use different pipelines, depending on the design. HT is an implementation of SMT. Any control-flow processor must provide at least one thread context; otherwise, there would be no control flow. &lt;br /&gt;&lt;br /&gt;If the processor has only one thread context, threading support still can be provided by software. That is, multiple software threads can multiplex over the same thread context. When a software thread is scheduled to run, its context is loaded from the memory into the hardware thread context. If it is scheduled off the processor, its context is saved into the memory.  &lt;br /&gt;&lt;br /&gt;Software thread design has an implication. Since the thread context loading/storing (or switch) is conducted by software, it requires the software thread design to guarantee that there are chances to conduct the switch operation (or thread scheduling). An easy way to implement is to leverage hardware interrupt. Once the software receives a hardware interrupt (timer or whatever), it executes the interrupt handler and within the handler, it schedules the threads. &lt;br /&gt;&lt;br /&gt;Sometimes, the timer is too long to wait. For example, when a thread is sleeping, before a timer handler is executed, no other thread can be scheduled. This is not desirable. A straightforward solution is, if a thread wants to sleep, it always invokes the scheduler, then the scheduler can switch on another thread. &lt;br /&gt;&lt;br /&gt;Now that multiple software threads can share the same hardware context, it is not hard to think that, a software thread context can also be multiplexed by another level of multiple software threads. This is true. So conceptually, software threads can be built with infinite levels, every higher level threads multiplex the contexts of its next level threads. &lt;br /&gt;&lt;br /&gt;Also a natural corollary is, a thread is only a thread in your level of discussion. It could contain multiple threads in a higher-level of discussion. Well, although this is true, people do not really build many levels of software threading libraries. Usually there are only two levels, one level shares the hardware context, and the other level shares the software context.&lt;br /&gt;&lt;br /&gt;This is reasonable. The most important reason is, all the software threads in one level are treated as a single thread in the next level, so they are scheduled as one thread in the next level. That means, they total only share the time slice of a single thread in the next level. If the next level thread is scheduled off the processor, none of them can be continuing. This is inconvenient. &lt;br /&gt;&lt;br /&gt;More inconvenient is, sometimes, only one thread wants to sleep, but all the other threads have to sleep with it together, because they are treated as a single thread in the next level scheduler who sees the sleep operation. This issue can be partially solved with non-blocking sleep.  That is, when a thread wants to sleep, it does not really sleep in the sense of the next level scheduler. It only sleeps in the eyes of its level's scheduler. This scheduler will schedule another thread at the same level. From the next-level thread scheduler's point of view, the thread is just continuing without sleep at all. In threading terminology, all the blocking operations (such as sleeping) in one level are implemented as non-blocking in its next level. (Well, this requires the system support for non-blocking operations, such as socket snooping, etc.)&lt;br /&gt;&lt;br /&gt;Only operating system kernel really takes control of the execution engine (i.e., the processor or the pipeline). So the time slice concept is only really meaningful to the kernel. That means, only the threading at kernel level can really manipulate all the resources. Higher levels of software threads should always try to leverage the support of kernel threading. This is the fundamental reason why we want at most one additional level of threading above kernel threads. &lt;br /&gt;&lt;br /&gt;Kernel threads are exposed to user applications through threading APIs. They are called native threads by the applications, such as NPTL or Linuxthreads in Linux, and WinThreads in Windows. The threading library implemented on top of native threads is called user-level threading. For the inconvenience we discussed above, not so many software today employ user-level threads. &lt;br /&gt;&lt;br /&gt;User-level threads have its own advantages in certain scenarios. For example, multiple user threads never run in parallel on multiple processors/cores, because they are actually just single thread from OS' point of view.&lt;br /&gt;&lt;br /&gt;Java thread is thread in another dimension. It is actually a language concept. It can implemented in any of threading mechanisms discussed above. Previously before Java, all the threading supports are kind of independent of programming languages. They are just system supports, and any languages can utilize if they want. Java takes a different approach that, it builds threading concept in its language. This is important for program semantic correctness. Hans had a PLDI paper with title "Threads Cannot be Implemented as a Library" [1]. And people are trying to introduce threading as a language construct into more languages.&lt;br /&gt;&lt;br /&gt;[1] Hans Boehm, Threads Cannot be Implemented as a Library, www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-109737427357728913?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/109737427357728913/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=109737427357728913' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/109737427357728913'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/109737427357728913'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/08/what-is-thread.html' title='What is a thread?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1073679165045421030</id><published>2008-04-09T08:09:00.000-07:00</published><updated>2008-04-09T08:12:49.821-07:00</updated><title type='text'>Quick hacking guide for Harmony GC development</title><content type='html'>I just wrote &lt;a href="http://people.apache.org/~xli/presentations/harmony_gc_source.pdf"&gt;an introduction on Harmony GC source code&lt;/a&gt;. It serves as a quick hacking guide. It's subject to changes based on the comments or questions received. Hope this is useful for the students who are applying for GSoC projects related to GC.&lt;br /&gt;&lt;br /&gt;The doc is at http://people.apache.org/~xli/presentations/harmony_gc_source.pdf&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1073679165045421030?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1073679165045421030/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1073679165045421030' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1073679165045421030'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1073679165045421030'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/04/quick-hacking-guide-for-harmony-gc.html' title='Quick hacking guide for Harmony GC development'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5941266186019116790</id><published>2008-04-04T02:29:00.000-07:00</published><updated>2008-04-04T02:41:17.498-07:00</updated><title type='text'>Parallel Garbage Collection</title><content type='html'>Recently, I gave a talk on "&lt;a href="http://people.apache.org/~xli/presentations/parallel_garbage_collection.pdf"&gt;Parallel Garbage Collection&lt;/a&gt;"[1] (2008-03-28) in &lt;a href="http://gelato.org/etws/program/index.php"&gt;Shanghai Many-Core Workshop 2008&lt;/a&gt;, arranged by gelato.org. &lt;br /&gt;&lt;br /&gt;The presentation discussed some common issues in STW parallel GC algorithms: &lt;ol&gt;&lt;br /&gt;&lt;li&gt;Traversal of object connection graph&lt;br /&gt;&lt;li&gt;Order of object copying&lt;br /&gt;&lt;li&gt;Phases of heap compaction&lt;br /&gt;&lt;li&gt;Marking of live object&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I also briefly mentioned several threading issues with garbage collection.&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Thread local objects&lt;br /&gt;&lt;li&gt;Finalizer processing&lt;br /&gt;&lt;li&gt;Concurrent collection&lt;br /&gt;&lt;li&gt;GC and transactional memory&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;[1]http://people.apache.org/~xli/presentations/parallel_garbage_collection.pdf&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5941266186019116790?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5941266186019116790/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5941266186019116790' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5941266186019116790'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5941266186019116790'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/04/parallel-garbage-collection.html' title='Parallel Garbage Collection'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7642446820470856520</id><published>2008-03-23T18:45:00.001-07:00</published><updated>2008-03-23T18:58:17.141-07:00</updated><title type='text'>Harmony project proposals for Google Summer of Code 2008</title><content type='html'>&lt;a href="http://code.google.com/soc/2008"&gt;Google Summer of Code 2008&lt;/a&gt; is open for &lt;a href="http://code.google.com/p/google-summer-of-code/wiki/AdviceforStudents"&gt;student applications&lt;/a&gt;. Harmony has following proposals together with other &lt;a href="http://wiki.apache.org/general/SummerOfCode2008"&gt;ASF projects&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-gc-1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Implement the "Compressor" GC proposed by Kermany and Petrank&lt;br /&gt;&lt;br /&gt;Keywords: Java, memory management, GC&lt;br /&gt;&lt;br /&gt;Description: The Compressor garbage collector [1] is a compacting GC that leverages virtual memory support in underlying OS. It compacts the heap in two passes.&lt;br /&gt;[1] Haim Kermany, Erez Petrank: The Compressor: concurrent, incremental, and parallel compaction. PLDI 2006.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Xiao-Feng Li (xiaofeng.li (a) gmail com)&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-gc-2&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Implement the "Mapping Collector" proposed by Wegiel and Krintz&lt;br /&gt;&lt;br /&gt;Keywords: Java, memory management, GC&lt;br /&gt;&lt;br /&gt;Description: The Mapping Collector [1] utilizes the virtual memory support in a novel way so that it can compact the heap without moving the objects or fixing the references.&lt;br /&gt;[1] Michal Wegiel and Chandra Krintz, The Mapping Collector: Virtual Memory Support for Generational, Parallel, and Concurrent Compaction, ASPLOS 2008.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Xiao-Feng Li (xiaofeng.li (a) gmail com)&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-gc-3&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Write a graphical front-end for Harmony memory management&lt;br /&gt;&lt;br /&gt;Keywords:Java, memory management, GC&lt;br /&gt;&lt;br /&gt;Description: Harmony runtime needs a graphic front-end visualizing the memory management activities and the runtime status. It can be standalone or better an Eclipse plugin. It can be online display of the runtime execution, or offline processing of the log.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Xiao-Feng Li (xiaofeng.li (a) gmail com)&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-gc-4&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Unify the native memory management of Harmony DRLVM&lt;br /&gt;&lt;br /&gt;Keywords: Java, virtual machine, memory management, GC&lt;br /&gt;&lt;br /&gt;Description: DRLVM uses inconsistent APIs for native memory management, such as APR or malloc or mmap. It is desirable to have a unified API for native memory management. Hopefully the runtime native memory usage can be managed with a global view and then optimized. This layer could be extended to provide the API for Java heap native management as well.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Xiao-Feng Li (xiaofeng.li (a) gmail com)&lt;br /&gt;Andrey Yakushev (andrey.yakushev (a) gmail com )&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-gc-5&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Build a garbage collector for C/C++ programs on the top of Harmony&lt;br /&gt;&lt;br /&gt;Keywords: C, C++, memory management, GC&lt;br /&gt;&lt;br /&gt;Description: One may notice a lack of open source effective parallel GC implementation for C/C++ programs. For example, Parrot (Perl 6) community expressed an [WWW] interest in attaching our GC to their code base. If numbers would show some benefit, we might get other adopters of our code base. Successful completion of GC library on the top of Harmony would teach a person in refactoring skills and give a good background in garbage collection.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Alexei Fedotov; Xiao-Feng Li (xiaofeng.li (a) gmail com)&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-tools-1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Implement "Bundle Tool", a tool to make binary snapshots of Java applications with Harmony&lt;br /&gt;&lt;br /&gt;Keywords: Java, tools&lt;br /&gt;&lt;br /&gt;Description: There is no simple way for Harmony users to define the list of needed classes, jars and native libraries for their applications. Create a tool that creates a Harmony package with the classes and native libraries used by specific application or work flow. First of all this application should collect data from one or multiple application runs and then create a Harmony bundle without unneeded classes and native code.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Egor Pasko; Mark Hindess&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-tools-2&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Implement a Java developer's command line tool&lt;br /&gt;&lt;br /&gt;Keywords: Java, tools&lt;br /&gt;&lt;br /&gt;Description: Harmony is missing several of the tools that ship with the JDK, including jar, jconsole, javaws and policytool. For this task you would implement one of these tools, either in Java or C/C++ if preferred.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Sian January; Mark Hindess&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-vm-1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Support invokedynamic bytecode instruction in Harmony VM and JIT&lt;br /&gt;&lt;br /&gt;Keywords: C++, java, virtual machine, JIT, bytecode, dynamic languages&lt;br /&gt;&lt;br /&gt;Description: Support the [WWW] invokedynamic instruction i.e. the ideas of [WWW] JSR 292 draft. And implement basic support for a dynamic language like Python, Ruby, JavaScript as a proof of concept. We want this language to have dynamic typing, reasonable user base, usable standard library, a set of compatibility tests. Students are free to choose the actual dynamic language. We will discuss the reasoning behind the choice. The code involved is C (VM part) C++ (JIT) and probably some class library part in Java. The task is rather challenging for the summer, hence, will require a lot of interaction with the team on the tricky details. Lots of fun guaranteed.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Egor Pasko&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-vm-2&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Integrate Harmony and Jikes RVM&lt;br /&gt;&lt;br /&gt;Keywords: Java, virtual machine, VM&lt;br /&gt;&lt;br /&gt;Description: [WWW] Jikes RVM is a research virtual machine that has been a test bed for many JVM and GC developments. This project will look to integrate the Harmony class libraries with Jikes RVM. This will require work on the VM interfaces (the Jikes RVM is a Java-in-Java VM meaning that current VM interfaces are written in Java rather than native code) as well as exploring how Jikes RVM can be integrated with Harmony's threading and other runtime models.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Ian Rogers; Tim Ellison (t.p.ellison (a) gmail com)&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-jit-1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Refactor Java Bytecode Translator in Harmony JIT&lt;br /&gt;&lt;br /&gt;Keywords: C++, JIT, bytecode, compilers, Java&lt;br /&gt;&lt;br /&gt;Description: The optimizing JIT (Jitrino.OPT) parses Java bytecode on early stages of method compilation to produce Internal Representation to allow further stages to optimize it. This code is well-tested, but not easy to extend (for example, not easy to teach JIT to understand new types of instructions) The major inconvenience is that translator makes things too complicated by trying to optimize on the fly. The task is to refactor the Java-Bytecode-Translator in the Jitrino.OPT to make the code cleaner and simplify the data structures used. Move optimization to a separate stage. Take care of correctly mapping line number info from bytecode into JIT instructions. Code is C++, but not a tricky style. The student will get an in-depth knowledge of Java bytecode, overall knowledge of just-in-time compilation techniques.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Egor Pasko&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;br /&gt;&lt;br /&gt;======================================&lt;br /&gt;&lt;b&gt;Subject ID: harmony-demo-1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Title: Make [WWW] FreeCol game playable on Harmony&lt;br /&gt;&lt;br /&gt;Keywords: Java, games, graphics&lt;br /&gt;&lt;br /&gt;Description: For someone who is interested in a graphical user interface development, enabling one of the most popular strategic games may be an interesting task. Since client API development is not finished, the one who would choose this task might learn designing of areas related to image processing, code development, bug fixing and refactoring. See for [WWW] details on the current status of the project. BTW, you may want to replace [WWW] FreeCol enabling with enabling of your favorite application, and this is welcome.&lt;br /&gt;&lt;br /&gt;Possible Mentors: Alexei Fedotov&lt;br /&gt;&lt;br /&gt;Status: Unassigned&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7642446820470856520?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7642446820470856520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7642446820470856520' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7642446820470856520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7642446820470856520'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/03/harmony-project-proposals-for-google.html' title='Harmony project proposals for Google Summer of Code 2008'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7438290427605555388</id><published>2008-03-23T17:33:00.000-07:00</published><updated>2009-01-15T02:19:47.187-08:00</updated><title type='text'>Parallel Compacting Garbage Collectors and their phases</title><content type='html'>I discussed several best known compacting GC sequential algorithm [1]. Now I'd like to focus on the phases of parallel compactors.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. Parallel LISP2 Sliding Compactor&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It has four main phases (or heap passes), as shown below.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4MytEQFYkbU/R-b7VACVjPI/AAAAAAAAAAs/Wk6OwlGXPf4/s1600-h/image002.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4MytEQFYkbU/R-b7VACVjPI/AAAAAAAAAAs/Wk6OwlGXPf4/s320/image002.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5181104759541959922" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The key to parallel LISP2 Compactor is to build a dependence list between a target block and its source blocks whose live objects are copied to the target block. The dependence list is built in the first phase when target address is computed.&lt;br /&gt;&lt;br /&gt;LISP2 Compactor stores target address in object header. It has to update the references before moving objects. And it has to put the reference-fixing phase between object-repointing and object-moving phases.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2. IBM's Moving Compactor&lt;/b&gt;&lt;br /&gt;It has three heap passes [2]. &lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4MytEQFYkbU/R-b7hACVjQI/AAAAAAAAAA0/gDkCfvoPiVk/s1600-h/image004.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4MytEQFYkbU/R-b7hACVjQI/AAAAAAAAAA0/gDkCfvoPiVk/s320/image004.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5181104965700390146" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The key of Moving Compactor is, it uses an &lt;b&gt;offset table&lt;/b&gt;, which stores the target address of the live objects. This avoids the overwriting of the target address info by a moved object, so that the references in heap can be updated any time (in any phase). Here it fixes the references after object moving phase. &lt;br /&gt;&lt;br&gt;&lt;br /&gt;A note added on 2009-1-15: Recently (2009-1-12), one of the authors of this compactor, Diab Abuaiadh, told me the following in his email: "It is simple to convert the algorithm to a one (heap) pass compaction by delaying moving the objects from first phase to second phase. This involves minor changes to the original algorithm. This change reduces the pause time while it has no negative impact on the locality of reference". Diab did not disclose the details about the "minor changes" required. It is unknown how similar it is to the STW Compactor of Compressor described below.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3. STW Compactor of Compressor&lt;/b&gt;&lt;br /&gt;The Stop-the-world compactor of Compressor [3] has 2 heap passes and 1 mark-bit table pass, hence having 2.5 phases if we consider the mark table traversal much more light-weighted.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4MytEQFYkbU/R-b7ngCVjRI/AAAAAAAAAA8/Lpl23yfz5Sk/s1600-h/image006.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_4MytEQFYkbU/R-b7ngCVjRI/AAAAAAAAAA8/Lpl23yfz5Sk/s320/image006.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5181105077369539858" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The key of Compressor is that it computes object target address without traversing the heap; instead, it goes through a separate &lt;b&gt;mark table&lt;/b&gt;, which encodes the location and &lt;b&gt;size&lt;/b&gt; of live objects. &lt;br /&gt;&lt;br /&gt;Compressor combines the object-moving and reference-fixing phases into one. This is impossible in IBM's Moving compactor, because there the target address was not ready for all live object during object-moving.&lt;br /&gt;&lt;br /&gt;Sliding Compactor and Moving Compactor do not need the separate mark-bit table.&lt;br /&gt;&lt;br /&gt;Compressor is an improvement by combining the Sliding Compactor and Moving Compactor. Compared to Sliding Compactor, Compressor contracts the Object-repointing phase into a mark-bit table traversal by employing the mark-bit table. It writes the target address in an offset table. Then it can combine the object-moving and reference-fixing phases into one. So algorithm-wise, Compressor is similar to Sliding Compactor -- but optimized with two additional data structures (mark table and offset table).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;4. Mapping Collector&lt;/b&gt;&lt;br /&gt;Mapping collector optimizes the phases further [4]. It does not move objects, hence no reference-fixing either. The idea is to simply unmap the pages that hold only garbage. These pages can be found by traversing mark-bit table. So it has one heap pass for marking and a mark-bit table pass, might be considered as to have 1.5 phases.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_4MytEQFYkbU/R-b7uQCVjSI/AAAAAAAAABE/W4hDxcFm4-4/s1600-h/image008.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://3.bp.blogspot.com/_4MytEQFYkbU/R-b7uQCVjSI/AAAAAAAAABE/W4hDxcFm4-4/s320/image008.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5181105193333656866" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The key of Mapping Collector is, it leverages OS's virtual memory support. It leverages a fact (or an important observation): object address is virtual address, to move a object means changing virtual address. But it is still virtual. While the goal of garbage collection is to free up the physical memory, it is good enough if we can free the memory occupied by the garbage. The only issue is, we should be able to expand the virtual space in one end of the heap to achieve a large contiguous free area. Then it attains the two goals of a compacting GC: &lt;br /&gt;a) to squeeze the garbage space out of the heap; Mapping Collector implements by unmapping the pages holding garbage.&lt;br /&gt;b) to get a contiguous free area; Mapping Collector implements by mapping the pages in one end of the heap.&lt;br /&gt;&lt;br /&gt;The blog entry stops here. The last words I would put are: the phases of the compactors do not necessarily imply their performance in a linear way. There are some other factors impacting the overall achieved performance besides phases, such as space overhead.&lt;br /&gt;&lt;br /&gt;[1] http://xiao-feng.blogspot.com/2007/04/sequential-compacting-garbage-collector.html&lt;br /&gt;[2]Diab Abuaiadh, Yoav Ossia, Erez Petrank, and Uri Silbershtein. An efficient parallel heap compaction algorithm. In OOPSLA'04.&lt;br /&gt;[3]Haim Kermany, Erez Petrank: The Compressor: concurrent, incremental, and parallel compaction. PLDI 2006.&lt;br /&gt;[4]Michal Wegiel and Chandra Krintz, The Mapping Collector: Virtual Memory Support for Generational, Parallel, and Concurrent Compaction, ASPLOS 2008.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7438290427605555388?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7438290427605555388/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7438290427605555388' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7438290427605555388'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7438290427605555388'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/03/parallel-compacting-garbage-collectors.html' title='Parallel Compacting Garbage Collectors and their phases'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_4MytEQFYkbU/R-b7VACVjPI/AAAAAAAAAAs/Wk6OwlGXPf4/s72-c/image002.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-900592501596028692</id><published>2008-02-29T07:05:00.000-08:00</published><updated>2008-03-02T02:27:14.457-08:00</updated><title type='text'>[Harmony GC Internal] SemiSpace garbage collector</title><content type='html'>Harmony has implemented a variant of the semi-space garbage collector with dynamic from-space and to-space size so as to get the best possible performance.&lt;br /&gt;&lt;br /&gt;The idea of the semi-space GC is to partition the space into two halves. Mutators always allocate in the from-space; when it is full, collectors copy the live objects from the from-space to the to-space,  and then flip the roles of the two half spaces.&lt;br /&gt;&lt;br /&gt;The advantage of the semi-space GC is that, it is a full copying GC, which is known to be efficient when the objects surviving ratio is low. There are basically two usage models of semi-space GC: one is to avoid the compacting collection, the other is to act as a generational collection.&lt;br /&gt;&lt;br /&gt;To avoid the compacting collection requires to reserve to-space for copying. But it should be as small as possible to keep high heap utilization. We only need to reserve adequate to-space and leave the rest space to from-space.&lt;br /&gt;&lt;br /&gt;To leverage the generational property requires to copy the survivors from from-space to to-space. But we do not want to copy the long-live objects back and forth between the two half-spaces many times. We should promote the long-live objects out of the semi-space generation to another mature object space.&lt;br /&gt;&lt;br /&gt;The ideas above brought us the Harmony semi-space GC design. Below is the heap layout and minor collection illustrations.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_4MytEQFYkbU/R8i-tJxrwwI/AAAAAAAAAAc/jeWgFpwTJTk/s1600-h/SemiSpace-GC-a.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://2.bp.blogspot.com/_4MytEQFYkbU/R8i-tJxrwwI/AAAAAAAAAAc/jeWgFpwTJTk/s320/SemiSpace-GC-a.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5172593854962713346" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4MytEQFYkbU/R8i-6pxrwxI/AAAAAAAAAAk/7VyoB1-pa90/s1600-h/SemiSpace-GC-b.gif"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_4MytEQFYkbU/R8i-6pxrwxI/AAAAAAAAAAk/7VyoB1-pa90/s320/SemiSpace-GC-b.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_5172594086890947346" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;NOS is always used for mutator allocation. Part of from-space is occupied by the survivors from last minor collection, the rest of the from-space is also used for mutator allocation. To-space is empty reserved for next minor collection. In a minor collection, the survivors of the newly allocated objects since last collection are copied to the to-space. The last-time survivors in from-space will be promoted to MOS, when they are older than a certain age. In current implementation, the age threshold is simply set to be one. That's good enough in our experiments. &lt;br /&gt;&lt;br /&gt;When the survivors are promoted to MOS, the boundary between MOS and NOS is shifted to NOS size so as to reserve some MOS space for next time survivor promotion. &lt;br /&gt;&lt;br /&gt;With careful design and tuning, we can give the MOS reserve space and the to-space only adequate sizes for copying accommodation. The problem is how about if the size is not enough in certain collection due to the application behavior change. &lt;br /&gt;&lt;br /&gt;In Harmony GC, if the to-space is not enough to hold all the new survivors, they can be copied to MOS directly as the older survivors. If the MOS reserve space is not enough to hold all survivors (old or new) promoted, a fallback mechanism is used. That is, the minor collection will switch to a entire-heap compacting collection, which will compact all the live objects in heap to the low end of MOS.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-900592501596028692?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/900592501596028692/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=900592501596028692' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/900592501596028692'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/900592501596028692'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/02/harmony-gc-internal-semi-space-garbage.html' title='[Harmony GC Internal] SemiSpace garbage collector'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_4MytEQFYkbU/R8i-tJxrwwI/AAAAAAAAAAc/jeWgFpwTJTk/s72-c/SemiSpace-GC-a.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3206739999884189492</id><published>2008-02-06T23:28:00.000-08:00</published><updated>2008-02-23T01:19:31.976-08:00</updated><title type='text'>Linux内核管理风格 (Linux kernel management style)</title><content type='html'>（中文翻译：Xiao-Feng Li）&lt;br /&gt;&lt;br /&gt;这个短文档介绍了Linux内核所偏爱的（或“造作的”--看你问谁了）管理风格。目的是与文档《CodingStyle》的内容在某种程度上相呼应，主要是为了避免一遍又一遍地回答[1]同样（或类似）的问题。&lt;br /&gt;&lt;br /&gt;管理风格非常个人化，很难量化，比几条简单的关于编码风格的规定要难多了，因此这个文档与实际可能有联系，也可能没有联系。开始写时只是闹着玩的，不过也不能说它完全没用。这个只能由你自己决定了。&lt;br /&gt;&lt;br /&gt;顺便说一句，当提到“内核管理者”时，指的是技术带头人，而不是公司里的那些传统的管理人员。如果你在你们公司里要签单、或者对你们部门的预算有点儿概念的话，那你基本上可以肯定不是内核管理者。因此这里的建议可能不一定对你有用。&lt;br /&gt;&lt;br /&gt;首先，我建议你买一本《高度成功人士的七个习惯》，_不_要读它。烧掉它，这是一个很重要的象征性姿态。&lt;br /&gt;&lt;br /&gt;[1] 本文档并没有真地回答什么问题，而是要让提问题的人不得不明白：我们并不知道答案是什么。&lt;br /&gt;&lt;br /&gt;好吧，下面开讲：&lt;br /&gt;&lt;br /&gt;                第一章：决定&lt;br /&gt;&lt;br /&gt;大家都以为是管理者在做决定，而且以为做决定很重要。以为决定越大、越难，管理者就一定越高层。这看起来很深刻也很明显，其实不一定正确。&lt;br /&gt;&lt;br /&gt;问题的诀窍在于_避免_做决定。特别是当有人对你说“请选择(a)或(b)，我们真地需要你做出决定”时，你作为一个管理者就有麻烦了。因为你管理的人应该比你知道更多的细节，如果他们需要你来做出技术决定，你一定麻烦大了。显然你没有能力替他们做出决定。&lt;br /&gt;&lt;br /&gt;（推论：如果你的手下不比你了解更多的细节，你一样死定了，尽管死的原因与前面完全不同。这种情况下，看起来你不适合这个管理工作，应该是_他们_来管理你的才华才对。）&lt;br /&gt;&lt;br /&gt;因此问题的诀窍是_避免_做决定，至少避免做大的、难的决定。可以做一些小的、没有什么后果的决定，这会让你看起来象是知道你自己在做什么。所以内核管理者要做的就是把大的、难的决定转变成没人在意的小问题。&lt;br /&gt;&lt;br /&gt;需要明白的是，大决定和小决定之间的真正区别在于：你是否可以在后来修正你的决定。任何决定都可以变成一个小决定，只要你能确保下面这条：即在你可能错了的时候（你_一定_会错的），你总能回过头将其影响消除。这样，突然之间，你变得具有两倍的管理才能：你做出了_两个_没有后果的决定--一个错误的_和_一个正确的。&lt;br /&gt;&lt;br /&gt;而且人们甚至会认为这是真正的领袖素质（咳，狗屁，咳）。&lt;br /&gt;&lt;br /&gt;这样，避免做决定的关键就变成了如何避免做任何不能挽回的事情。不要被堵在一个无处逃生的死胡同。死胡同里的耗子会很危险--而一个死胡同里的管理者就太可怜了。&lt;br /&gt;&lt;br /&gt;实践证明，由于_在任何情况下_都没有人会蠢到真得让内核管理者去承担巨大的财政责任，因此Linux内核中的决定通常是很容易挽回的。既然你没有机会糟蹋掉一大笔你还不起的钱，你需要挽回的也就是一个技术决定而已；而对技术决定来说，挽回措施也相当容易：只需要告诉别人你是一个手生的笨蛋，说你很抱歉，然后去除那些你让大家在过去的一年里做的没用的东西。立刻，你一年前所做的决定就根本不是一个大决定了，因为它的影响可以被轻易地消除。&lt;br /&gt;&lt;br /&gt;实践证明有些人对这个方法不是很适应，由于下面的两个原因：&lt;br /&gt; - 承认自己是个白痴比想像的要难。我们都要维护自己的形象，到大庭广众之下说自己错了有时候的确非常困难。&lt;br /&gt; - 对那些可怜的底层工程师来说，要别人告诉他们去年做的东西完全没用会有些难度。尽管实际的_工作_本身可以简单地删除掉，你却可能不可挽回地失去那个工程师的信任。记住：“不可挽回”是我们首要竭力避免的，因此你的决定最终还是成了一个大决定。&lt;br /&gt;&lt;br /&gt;比较愉快的是，这两个问题的难度都可以被有效地降低：只要你实现承认自己不是特别明白；并在实施之前告诉大家，你的决定很初步，很有可能是错的。你应该一直保留改变主意的权利，并让大家非常_清楚_这一点。其实，在你还没有做蠢事之前承认自己愚蠢，要比做了之后要容易的多。&lt;br /&gt;&lt;br /&gt;这样，当最终被证明是件蠢事时，人们最多也就转转眼睛，说“唉，他又犯傻了”。&lt;br /&gt;&lt;br /&gt;这种事先承认没有把握的做法，也会让那些真正干活的人好好想想是否值得做这件事。千万记住，如果连_他们_都不能肯定，你一定不要用许诺的方法来鼓励他们，即承诺他们的工作会被加到Linux内核中。还是要让他们在启动一项大任务之前深思熟虑一下。&lt;br /&gt;&lt;br /&gt;记住：他们会比你知道更多的细节，而且他们通常认为他们对一切相关问题都已了然于胸。这个时候，管理者最好不要再去给他们吃定心丸了，而是给他们一剂用挑剔的眼光审视任务的苦口良药。&lt;br /&gt;&lt;br /&gt;顺便提一句，另一种避免做决定的办法是简单地嗔怪道：“我们不能两个都做吗？”并做出一副可怜相。相信我，这的确有效。如果还不清楚到底那个方法更好，他们最终一定会搞清楚的。很可能的结果是，两个小组都对结论感到失望并就此放弃。&lt;br /&gt;&lt;br /&gt;放弃听起来似乎是个失败，但这通常是一个信号，表明两个项目都有问题，而两个小组的成员都不能下决定的原因可能就是因为他们都错了。这样你最后倒成了香饽饽，而且你还避免了又一个可能让你死菜的决定。&lt;br /&gt;&lt;br /&gt;                第二章：人&lt;br /&gt;&lt;br /&gt;大多数人是白痴，做为管理者，你必须面对现实、和他们周旋。或许更重要的是，_他们_必须要和_你_周旋。&lt;br /&gt;&lt;br /&gt;事实证明，虽然消除技术错误很容易，消除人性的混乱却不容易。你必须与别人的个性共处--以及与你自己的个性共处。&lt;br /&gt;&lt;br /&gt;不过，要准备成为一个内核管理者，你最好记着不要断了自己的后路、殃及无辜、或与太多的内核开发者作对。事实证明，与人闹翻很容易，可再想和好就不容易了。因此“闹翻”立刻可以归类为“不可挽回”的行为；按照第一章，绝对应该避免。&lt;br /&gt;&lt;br /&gt;有几条简单的规则：&lt;br /&gt; (1) 不要称呼别人“蠢货” （至少不要在公开场合）&lt;br /&gt; (2) 当你忘了(1)时，就学会道歉&lt;br /&gt;&lt;br /&gt;(1)的问题是它太容易违反了。你可以有一万种不同的方式骂别人“蠢货”[2]，有时候你甚至都没有意识到，而且哪一次你都是不可抑制地义愤填膺。&lt;br /&gt;&lt;br /&gt;而且你越是相信你是对的（面对现实吧，你其实可以称呼_任何人_“蠢货”，而且你总_会_是对的），事后道歉对你来说会越难。&lt;br /&gt;&lt;br /&gt;要解决这个问题，你只有两个选择：&lt;br /&gt; - 成为道歉高手&lt;br /&gt; - 把你的“爱”均匀播洒，让大家觉得你的出手是公平的。你得有点儿创意，这样他们有时候可能还会觉得挺好玩儿的。&lt;br /&gt;&lt;br /&gt;希望靠着礼貌把一切摆平是不可能的。没有人会信任一个明显隐藏自己真实性情的人。&lt;br /&gt;&lt;br /&gt;[2]保罗西蒙过去唱《失去爱人方法五十种》，那是因为很明白，《告诉一个开发者是蠢货的方法一万种》听起来没有前者那么有韵味。不过我相信他考虑过后者。&lt;br /&gt;&lt;br /&gt;                第三章：人 II - 好的那类&lt;br /&gt;&lt;br /&gt;既然事实表明大多数人都是白痴，那结论自然很可悲：你也是一位。另外，即使我们都暗自觉得自己高于普通人（承认吧，没有人认为自己是普通人甚至不如普通人），我们也要承认我们不是那只最锋利的刀子，总有人没有你那么白痴。&lt;br /&gt;&lt;br /&gt;有些人容不下聪明人。有些人则会利用他们。&lt;br /&gt;&lt;br /&gt;作为一个内核的维护者，你要确保你属于第二种。你要紧随聪明人，因为他们能让你的工作变得容易。特别是，他们能替你做决定。这就是问题所在。&lt;br /&gt;&lt;br /&gt;因此，如果你发现有人比你聪明，那就顺水推舟吧。之后你的管理任务就变成诸如“听起来不错--去搞吧”，或者“这个听起来挺好，不过那个xxx怎么样？”。特别是第二种说法，你要么能跟着学点儿“xxx”的东西，要么能显示_额外_的管理才干，因为你指出了一些聪明人没有想过的东西。不管是哪种情况，你都是赢家。&lt;br /&gt;&lt;br /&gt;值得提出的是，你要明白，在一个领域的牛人在另一个领域就不一定了。因此尽管你可能在几个不同的方向上给与引导激励，需要承认的是，他们可能只在他们自己的领域是专家，而在其它领域一无是处。值得庆幸的是，人们会自然地象被引力吸引一样被拉回他们擅长的领域。因此，只要你不是激励的太厉害，那么即使你真地在其它方向上引导了他们，也不至于造成不可挽回的后果。&lt;br /&gt;&lt;br /&gt;                第四章：责任追究&lt;br /&gt;&lt;br /&gt;任何事情都回出错，因此人们需要找个人来承担责任。不好意思，你就是这个人。&lt;br /&gt;&lt;br /&gt;承担责任其实并不困难，特别是当人们明白并不全是你的错。这给我们提醒了一种最好的承担责任的方式：承担他人的责任。你会因为替人受过而感觉高尚，他也会因为免于责任而感觉不错。那个由于你的过错而丢失了36GB色情收藏的家伙也不得不承认，你至少没有试图逃避责任。&lt;br /&gt;&lt;br /&gt;然后你要_私下里_让那个真正把事情搞砸的人（如果你能找出他来）知道，是他把事情搞砸了。并不是说他因此就可以避免以后再出问题，而是因此他知道他欠你一个人情。而且，也许更重要的是，他很可能也是那个能修正问题的人；因为，我们得承认，那个人肯定不是你。&lt;br /&gt;&lt;br /&gt;承担责任也是你能成为管理者的首要原因。这是别人愿意信任你、并给你可能的荣誉的部分原因，因为正是你能站出来说“是我搞砸了”。如果你已经遵循了前面的规则，现在说这句话对你来说应该很容易了。&lt;br /&gt;&lt;br /&gt;                &lt;br /&gt;                第五章：需要避免的东西&lt;br /&gt;&lt;br /&gt;有一件事比骂人“蠢货”更让人不能接受，那就是用假惺惺的口气叫人“蠢货”。 前者还可以道歉，后者则没有机会。别人不会再听你的了，即使在你做的很漂亮的时候。&lt;br /&gt;&lt;br /&gt;我们每个人都认为自己比别人强，当别人在充大个时，我们会错误地觉得自己_真地_被看扁了。因此，也许你在道德或智力上的确高于他人，但不要太显摆了，除非你是有意要惹恼某个人[3]。&lt;br /&gt;&lt;br /&gt;同样，也不要太礼貌或太微妙了。礼貌很容易过度从而掩盖真正的问题，而且正如人们所说：“在互联网上，没人能听出来你的微妙”。因此，还是要大张旗鼓地摆明你的观点，不能寄望人们能揣摩出你的意思。&lt;br /&gt;&lt;br /&gt;加点儿幽默能中和你的鲁直和说教。如果你过分到显得荒谬，反倒可以达到你的目的，且不会让接受者感到不爽，他只会觉得你傻的可爱。因此幽默可以帮助克服人性的障碍，而这个障碍这是我们在对待批评时所共有的。&lt;br /&gt;&lt;br /&gt;[3]提示：互联网上那些与你的工作无直接关系的讨论组是发泄对他人不满的好地方。每隔一阵儿，带着嘲弄发些讨厌的帖子加入一场论战，会让你感到神清气爽。只是别把垃圾甩在自家门口了。&lt;br /&gt;&lt;br /&gt;                第六章：为什么是我？&lt;br /&gt;&lt;br /&gt;既然你的主要任务是替人受过，并且还要痛苦地让所有的人看出来你是个生手，那么一个显然的问题是：当初干嘛要干这个？&lt;br /&gt;&lt;br /&gt;本质上说，尽管可能不会有十来岁的小姑娘（或小男孩，这里我们不需要提出甄别或扮演性别主义者）尖叫着追到你的更衣室外敲门，你也_必将_由于是“负责人”而获得巨大的个人成就感。尽管你的所谓领导其实就是竭力追上别人并在其后尽力飞奔，你就不要太在意这些了，因为所有的人仍然会认为你就是负责人。&lt;br /&gt;&lt;br /&gt;如果你能把这个问题搞定，那将是一项伟大的成就。&lt;br /&gt;&lt;br /&gt;=======================================================================&lt;br /&gt;Original from http://lwn.net/Articles/105375/&lt;br /&gt;[Posted October 6, 2004 by corbet]&lt;br /&gt;&lt;br /&gt;                Linux kernel management style&lt;br /&gt;&lt;br /&gt;This is a short document describing the preferred (or made up, depending on who you ask) management style for the linux kernel.  It's meant to mirror the CodingStyle document to some degree, and mainly written to avoid answering (*) the same (or similar) questions over and over again. &lt;br /&gt;&lt;br /&gt;Management style is very personal and much harder to quantify than simple coding style rules, so this document may or may not have anything to do with reality.  It started as a lark, but that doesn't mean that it might not actually be true. You'll have to decide for yourself.&lt;br /&gt;&lt;br /&gt;Btw, when talking about "kernel manager", it's all about the technical lead persons, not the people who do traditional management inside companies.  If you sign purchase orders or you have any clue about the budget of your group, you're almost certainly not a kernel manager. These suggestions may or may not apply to you. &lt;br /&gt;&lt;br /&gt;First off, I'd suggest buying "Seven Habits of Highly Successful People", and NOT read it.  Burn it, it's a great symbolic gesture. &lt;br /&gt;&lt;br /&gt;(*) This document does so not so much by answering the question, but by making it painfully obvious to the questioner that we don't have a clue to what the answer is. &lt;br /&gt;&lt;br /&gt;Anyway, here goes:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  Chapter 1: Decisions&lt;br /&gt;&lt;br /&gt;Everybody thinks managers make decisions, and that decision-making is important.  The bigger and more painful the decision, the bigger the manager must be to make it.  That's very deep and obvious, but it's not actually true. &lt;br /&gt;&lt;br /&gt;The name of the game is to _avoid_ having to make a decision.  In particular, if somebody tells you "choose (a) or (b), we really need you to decide on this", you're in trouble as a manager.  The people you manage had better know the details better than you, so if they come to you for a technical decision, you're screwed.  You're clearly not competent to make that decision for them. &lt;br /&gt;&lt;br /&gt;(Corollary:if the people you manage don't know the details better than you, you're also screwed, although for a totally different reason. Namely that you are in the wrong job, and that _they_ should be managing your brilliance instead). &lt;br /&gt;&lt;br /&gt;So the name of the game is to _avoid_ decisions, at least the big and painful ones.  Making small and non-consequential decisions is fine, and makes you look like you know what you're doing, so what a kernel manager needs to do is to turn the big and painful ones into small things where nobody really cares. &lt;br /&gt;&lt;br /&gt;It helps to realize that the key difference between a big decision and a small one is whether you can fix your decision afterwards.  Any decision can be made small by just always making sure that if you were wrong (and you _will_ be wrong), you can always undo the damage later by backtracking.  Suddenly, you get to be doubly managerial for making _two_ inconsequential decisions - the wrong one _and_ the right one. &lt;br /&gt;&lt;br /&gt;And people will even see that as true leadership (*cough* bullshit *cough*).&lt;br /&gt;&lt;br /&gt;Thus the key to avoiding big decisions becomes to just avoiding to do things that can't be undone.  Don't get ushered into a corner from which you cannot escape.  A cornered rat may be dangerous - a cornered manager is just pitiful. &lt;br /&gt;&lt;br /&gt;It turns out that since nobody would be stupid enough to ever really let a kernel manager have huge fiscal responsibility _anyway_, it's usually fairly easy to backtrack.  Since you're not going to be able to waste huge amounts of money that you might not be able to repay, the only thing you can backtrack on is a technical decision, and there back-tracking is very easy: just tell everybody that you were an incompetent nincompoop, say you're sorry, and undo all the worthless work you had people work on for the last year.  Suddenly the decision you made a year ago wasn't a big decision after all, since it could be easily undone. &lt;br /&gt;&lt;br /&gt;It turns out that some people have trouble with this approach, for two reasons:&lt;br /&gt; - admitting you were an idiot is harder than it looks.  We all like to maintain appearances, and coming out in public to say that you were wrong is sometimes very hard indeed. &lt;br /&gt; - having somebody tell you that what you worked on for the last year wasn't worthwhile after all can be hard on the poor lowly engineers too, and while the actual _work_ was easy enough to undo by just deleting it, you may have irrevocably lost the trust of that engineer.  And remember: "irrevocable" was what we tried to avoid in the first place, and your decision ended up being a big one after all. &lt;br /&gt;&lt;br /&gt;Happily, both of these reasons can be mitigated effectively by just admitting up-front that you don't have a friggin' clue, and telling people ahead of the fact that your decision is purely preliminary, and might be the wrong thing.  You should always reserve the right to change your mind, and make people very _aware_ of that.  And it's much easier to admit that you are stupid when you haven't _yet_ done the really stupid thing.&lt;br /&gt;&lt;br /&gt;Then, when it really does turn out to be stupid, people just roll their eyes and say "Oops, he did it again".  &lt;br /&gt;&lt;br /&gt;This preemptive admission of incompetence might also make the people who actually do the work also think twice about whether it's worth doing or not.  After all, if _they_ aren't certain whether it's a good idea, you sure as hell shouldn't encourage them by promising them that what they work on will be included.  Make them at least think twice before they embark on a big endeavor. &lt;br /&gt;&lt;br /&gt;Remember: they'd better know more about the details than you do, and they usually already think they have the answer to everything.  The best thing you can do as a manager is not to instill confidence, but rather a healthy dose of critical thinking on what they do. &lt;br /&gt;&lt;br /&gt;Btw, another way to avoid a decision is to plaintively just whine "can't we just do both?" and look pitiful.  Trust me, it works.  If it's not clear which approach is better, they'll eventually figure it out.  The answer may end up being that both teams get so frustrated by the situation that they just give up. &lt;br /&gt;&lt;br /&gt;That may sound like a failure, but it's usually a sign that there was something wrong with both projects, and the reason the people involved couldn't decide was that they were both wrong.  You end up coming up smelling like roses, and you avoided yet another decision that you could have screwed up on. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  Chapter 2: People&lt;br /&gt;&lt;br /&gt;Most people are idiots, and being a manager means you'll have to deal with it, and perhaps more importantly, that _they_ have to deal with _you_. &lt;br /&gt;&lt;br /&gt;It turns out that while it's easy to undo technical mistakes, it's not as easy to undo personality disorders.  You just have to live with theirs - and yours. &lt;br /&gt;&lt;br /&gt;However, in order to prepare yourself as a kernel manager, it's best to remember not to burn any bridges, bomb any innocent villagers, or alienate too many kernel developers. It turns out that alienating people is fairly easy, and un-alienating them is hard. Thus "alienating" immediately falls under the heading of "not reversible", and becomes a no-no according to Chapter 1.&lt;br /&gt;&lt;br /&gt;There's just a few simple rules here:&lt;br /&gt; (1) don't call people d*ckheads (at least not in public)&lt;br /&gt; (2) learn how to apologize when you forgot rule (1)&lt;br /&gt;&lt;br /&gt;The problem with #1 is that it's very easy to do, since you can say "you're a d*ckhead" in millions of different ways (*), sometimes without even realizing it, and almost always with a white-hot conviction that you are right. &lt;br /&gt;&lt;br /&gt;And the more convinced you are that you are right (and let's face it, you can call just about _anybody_ a d*ckhead, and you often _will_ be right), the harder it ends up being to apologize afterwards. &lt;br /&gt;&lt;br /&gt;To solve this problem, you really only have two options:&lt;br /&gt; - get really good at apologies&lt;br /&gt; - spread the "love" out so evenly that nobody really ends up feeling like they get unfairly targeted.  Make it inventive enough, and they might even be amused. &lt;br /&gt;&lt;br /&gt;The option of being unfailingly polite really doesn't exist. Nobody will trust somebody who is so clearly hiding his true character.&lt;br /&gt;&lt;br /&gt;(*) Paul Simon sang "Fifty Ways to Lose Your Lover", because quite frankly, "A Million Ways to Tell a Developer He Is a D*ckhead" doesn't scan nearly as well.  But I'm sure he thought about it. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  Chapter 3: People II - the Good Kind&lt;br /&gt;&lt;br /&gt;While it turns out that most people are idiots, the corollary to that is sadly that you are one too, and that while we can all bask in the secure knowledge that we're better than the average person (let's face it, nobody ever believes that they're average or below-average), we should also admit that we're not the sharpest knife around, and there will be other people that are less of an idiot that you are. &lt;br /&gt;&lt;br /&gt;Some people react badly to smart people.  Others take advantage of them. &lt;br /&gt;&lt;br /&gt;Make sure that you, as a kernel maintainer, are in the second group. Suck up to them, because they are the people who will make your job easier. In particular, they'll be able to make your decisions for you, which is what the game is all about.&lt;br /&gt;&lt;br /&gt;So when you find somebody smarter than you are, just coast along.  Your management responsibilities largely become ones of saying "Sounds like a good idea - go wild", or "That sounds good, but what about xxx?".  The second version in particular is a great way to either learn something new about "xxx" or seem _extra_ managerial by pointing out something the smarter person hadn't thought about.  In either case, you win.&lt;br /&gt;&lt;br /&gt;One thing to look out for is to realize that greatness in one area does not necessarily translate to other areas.  So you might prod people in specific directions, but let's face it, they might be good at what they do, and suck at everything else.  The good news is that people tend to naturally gravitate back to what they are good at, so it's not like you are doing something irreversible when you _do_ prod them in some&lt;br /&gt;direction, just don't push too hard.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  Chapter 4: Placing blame&lt;br /&gt;&lt;br /&gt;Things will go wrong, and people want somebody to blame. Tag, you're it.&lt;br /&gt;&lt;br /&gt;It's not actually that hard to accept the blame, especially if people kind of realize that it wasn't _all_ your fault.  Which brings us to the best way of taking the blame: do it for another guy. You'll feel good for taking the fall, he'll feel good about not getting blamed, and the guy who lost his whole 36GB porn-collection because of your incompetence will grudgingly admit that you at least didn't try to weasel out of it.&lt;br /&gt;&lt;br /&gt;Then make the developer who really screwed up (if you can find him) know _in_private_ that he screwed up.  Not just so he can avoid it in the future, but so that he knows he owes you one.  And, perhaps even more importantly, he's also likely the person who can fix it.  Because, let's face it, it sure ain't you. &lt;br /&gt;&lt;br /&gt;Taking the blame is also why you get to be manager in the first place. It's part of what makes people trust you, and allow you the potential glory, because you're the one who gets to say "I screwed up".  And if you've followed the previous rules, you'll be pretty good at saying that by now. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  Chapter 5: Things to avoid&lt;br /&gt;&lt;br /&gt;There's one thing people hate even more than being called "d*ckhead", and that is being called a "d*ckhead" in a sanctimonious voice.  The first you can apologize for, the second one you won't really get the chance.  They likely will no longer be listening even if you otherwise do a good job. &lt;br /&gt;&lt;br /&gt;We all think we're better than anybody else, which means that when somebody else puts on airs, it _really_ rubs us the wrong way.  You may be morally and intellectually superior to everybody around you, but don't try to make it too obvious unless you really _intend_ to irritate somebody (*). &lt;br /&gt;&lt;br /&gt;Similarly, don't be too polite or subtle about things. Politeness easily ends up going overboard and hiding the problem, and as they say, "On the internet, nobody can hear you being subtle". Use a big blunt object to hammer the point in, because you can't really depend on people getting your point otherwise.&lt;br /&gt;&lt;br /&gt;Some humor can help pad both the bluntness and the moralizing.  Going overboard to the point of being ridiculous can drive a point home without making it painful to the recipient, who just thinks you're being silly.  It can thus help get through the personal mental block we all have about criticism. &lt;br /&gt;&lt;br /&gt;(*) Hint: internet newsgroups that are not directly related to your work are great ways to take out your frustrations at other people. Write insulting posts with a sneer just to get into a good flame every once in a while, and you'll feel cleansed. Just don't crap too close to home.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  Chapter 6: Why me?&lt;br /&gt;&lt;br /&gt;Since your main responsibility seems to be to take the blame for other peoples mistakes, and make it painfully obvious to everybody else that you're incompetent, the obvious question becomes one of why do it in the first place?&lt;br /&gt;&lt;br /&gt;First off, while you may or may not get screaming teenage girls (or boys, let's not be judgmental or sexist here) knocking on your dressing room door, you _will_ get an immense feeling of personal accomplishment for being "in charge".  Never mind the fact that you're really leading by trying to keep up with everybody else and running after them as fast as you can.  Everybody will still think you're the person in charge. &lt;br /&gt;&lt;br /&gt;It's a great job if you can hack it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3206739999884189492?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3206739999884189492/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3206739999884189492' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3206739999884189492'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3206739999884189492'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/02/linux-kernel-management-style-chinese.html' title='Linux内核管理风格 (Linux kernel management style)'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5522308335255978921</id><published>2008-01-26T03:12:00.000-08:00</published><updated>2008-01-26T21:18:30.308-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>GC safe-point (or safepoint) and safe-region</title><content type='html'>&lt;b&gt;Root references&lt;/b&gt;&lt;br /&gt;An object is dead really means it is useless. Only the programmer knows if an object is useless or not. In order for the program to decide if an object is useless, we can use compiler analysis, reference counting, or reachability analysis. &lt;br /&gt;&lt;br /&gt;Reachability analysis assumes an object is live as long as it is reachable by the mutator. If an object's reference is contained by a slot of the mutator's stack, it's directly reachable. Those objects reachable from reachable objects are also reachable. So the issue for reachability analysis is to find out the references that are directly reachable, which are root references. The set of root references is root set.&lt;br /&gt;&lt;br /&gt;The mutator's context has the data that are directly reachable, so to get root set is to find object references in the context. The context of a mutator refers to its stack and its register file (and some other thread-specific data). Global data are also directly reachable.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Root set enumeration&lt;/b&gt;&lt;br /&gt;Normally, if a GC uses reachability to determine an object's liveness, GC needs to get a consistent snapshot of the mutator's context, so as to enumerate the root references. This is true for both stop-the-world (STW) and concurrent GC (mostly). "Consistent" means the snapshot looks like taken at a single time point. A consistent snapshot of root references are necessary for correctness, otherwise some live objects might be lost. Then the question is how to get the consistent mutator's context snapshot.&lt;br /&gt;&lt;br /&gt;To get the consistent snapshot, a simple way is that the mutator suspends its execution during the root references enumeration. The snapshot is also consistent if the root set does not change during the enumeration process.  &lt;br /&gt;&lt;br /&gt;When a mutator suspends its execution, it is not necessarily able to enumerate the root references in its context, unless it book-keeps the reference information in its context. That is, it should be able to tell which stack slots have references, and which registers hold references. If GC can accurately gets the information, it is called precise root set enumeration; or it's imprecise. &lt;br /&gt;&lt;br /&gt;(For imprecise enumeration, GC  has to use some heuristics to conservatively guess the references from the context. So the GC is called conservative GC. This essay only discusses precise enumeration.)&lt;br /&gt;&lt;br /&gt;Harmony supports precise root set enumeration with GC safe-point and safe-region.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Safe-point (or safepoint)&lt;/b&gt;&lt;br /&gt;In order to support precise enumeration, JIT compiler should do additional work, because only JIT knows exactly stack frame info and register contents. When JIT compiles a method, for every instruction, it can book-keep the root reference information in case the execution is suspended at that instruction. &lt;br /&gt;&lt;br /&gt;But to remember the info for every instruction is too expensive. It requires substantial space to store the information. This is also unnecessary, because only a few instructions will have the chances to be the suspension points in real execution. JIT only needs to book-keep information for those instruction points -- they are called safe-points. Safe-point means it is a safe suspension point for root set enumeration.&lt;br /&gt;&lt;br /&gt;Btw, the ability of a compiler to know exact stack slots' information is not universally available in all programming languages. Only safe languages have the ability. For example, C/C++ doesn't.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Mutator suspension&lt;/b&gt;&lt;br /&gt;The question to safe-points is, how we can guarantee that the mutator is suspended at safe-point. &lt;br /&gt;&lt;br /&gt;There are basically two kinds of approaches to suspend a mutator, preemptively or voluntarily. The preemptive approach is to suspend the mutator whenever GC needs to start a collection. If it finds the mutator is suspended at an unsafe point, it will resume the mutator, rolling it forward to a safe-point. This was implemented in ORP [1], the predecessor of Harmony. But currently almost no JVM takes this approach.&lt;br /&gt;&lt;br /&gt;The approach used in Harmony is voluntary suspension. When GC wants to trigger a collection, it simply sets a flag; the mutators poll the flag periodically, and will suspend once they find the flag is set. Those polling points are safe-points. It's mostly JIT's responsibility to insert the pollings at proper positions. Sometimes VM also needs to have some polling points.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Polling point&lt;/b&gt;&lt;br /&gt;So where are the right places for polling GC trigger event? As I discussed above, we do not want to have polling points for every instruction. For voluntary suspension, a more serious problem is the polling overhead. So the basic principles for polling point insertion are: Firstly, polling points should be frequent enough so that GC does not wait too long for a mutator to suspend, because other mutators might be waiting for GC to free the space in order to continue. Secondly, polling points should not be too frequent to introduce big runtime overhead. &lt;br /&gt;&lt;br /&gt;The best result is to have only adequate polling points that are necessary and sufficient. &lt;br /&gt;1. The mandatory polling points are the allocation sites. Allocation can trigger collection, so allocation site has to be a safe point. &lt;br /&gt;2. Long-time execution are always associated with method call or loop. So call sites and loop back sites are also expected polling points. &lt;br /&gt;&lt;br /&gt;Those are the sites for polling points in Harmony: allocation sites, call sites and loop back sites. Mostly the runtime overhead is smaller than 1%. Unfortunately we found safe-point alone is not sufficient. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Safe-region&lt;/b&gt;&lt;br /&gt;Why can safe-point alone be not sufficient? The reason is we forgot one case of long time execution. We forgot it because it's actually not long time execution, but long time idle. There are situations when the application can not respond promptly to a GC trigger event, such as sleep, or being blocked in a system call. These operations are out of JVM's control. JVM can not respond to GC trigger event in that period. So we introduce safe-region to solve the problem.&lt;br /&gt;&lt;br /&gt;Safe-region is the section of code that no references are mutated in it, then it is safe to enumerate roots at any points of that region. In other words, the safe region is a big extended safe-point. &lt;br /&gt;&lt;br /&gt;In safe-point design, the mutator polling for GC event will respond if the event is triggered. It responds by setting a ready flag when it's sure to suspend. Then the GC can proceed with root set enumeration. This is a hand-shaking protocol. &lt;br /&gt;&lt;br /&gt;Safe-region just follows this protocol. The mutator sets the ready flag when it enters a safe-region. Before it leaves the region, it checks if GC has finished its enumeration (or collection), and no longer needs the mutator under suspension state. If it's true, it goes ahead and leaves the region; otherwise, it suspends itself as in a safe-point.&lt;br /&gt;&lt;br /&gt;In Harmony implementation, we insert suspend_enable and suspend_disable to delimit the scope of safe-region.&lt;br /&gt;&lt;br /&gt;[1] ORP (Open Runtime Platform), http://orp.sf.net ;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5522308335255978921?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5522308335255978921/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5522308335255978921' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5522308335255978921'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5522308335255978921'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/01/gc-safe-point-and-safe-region.html' title='GC safe-point (or safepoint) and safe-region'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-6281458984626197035</id><published>2008-01-20T01:05:00.000-08:00</published><updated>2008-01-26T20:21:25.291-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>[Harmony GC Internal] Design principles of Harmony GC</title><content type='html'>Harmony GC code base was designed with following principles:&lt;ol&gt;&lt;br /&gt;&lt;li&gt;&lt;b&gt;Performance&lt;/b&gt;. This is mainly achieved by parallelizing the important phases with innovative algorithms. Scalability across multi-core should be good as well.&lt;br /&gt;&lt;li&gt;&lt;b&gt;Modularity&lt;/b&gt;. Subcomponents of GC are well modularized, so that the modifications in one component does not require to change another. At the same time, the improvement in one component can easily be shared by other components.&lt;br /&gt;&lt;li&gt;&lt;b&gt;Flexibility&lt;/b&gt;. It should be easy to add a new collection algorithm or to experiment different ideas.&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;I personally think the principles have been achieved largely. &lt;br /&gt;&lt;p&gt;&lt;b&gt;Performance&lt;/b&gt;&lt;br /&gt;All of the collection algorithms are parallelized with well-tuned load-balancing mechanism. We developed three different parallelization algorithms for marking&amp;forwarding, compaction, and sweeping. They are suitable for different situations.  &lt;br /&gt;&lt;br /&gt;For marking&amp;forwarding parallelization, I experimented with three different load-balancing mechanisms: work-stealing, task-pushing[1] and pool-sharing. Currently we use pool-sharing in the code base for its simplicity.&lt;br /&gt;&lt;p&gt;&lt;b&gt;Modularity&lt;/b&gt;&lt;br /&gt;Harmony GC has multiple collection algorithms. Each algorithm manages one type of space. The heap consists of multiple spaces, then a complete GC is composed by combining multiple collection algorithms. This is natural and different collection algorithm shares a large body of common code.&lt;br /&gt;&lt;br /&gt;Harmony GC abstracts the thread entities into collectors and mutators, and both of them are allocators. The abstraction is important for modularity. A space is allocated through an allocator, be it a mutator or collector. The space can be used for mutator allocation as nursery object space (NOS), or for collection allocation as mature object space (MOS), while its allocation algorithm does not change.&lt;br /&gt;&lt;p&gt;&lt;b&gt;Flexibility&lt;/b&gt;&lt;br /&gt;It is easy to extend Harmony GC with new collection algorithms, largely due to its modularity design. Now it has collection algorithms of trace-forward, semi-space, slide-compact, move-compact, mark-sweep algorithms and their generational variants. More algorithms are under development. Those algorithms' implementations have proved Harmony GC's flexibility.&lt;br /&gt;&lt;br /&gt;Harmony GC can dynamically switch between several collection algorithms and between generational and non-generational modes. It's achieved by the flexible design.&lt;br /&gt;&lt;br /&gt;In software engineering wise, many people would agree that it's hard to have both performance, modularity and flexibility. Then how can we achieve them simultaneously? There a couple of techniques worth mentioning.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;GC helper inlining&lt;/b&gt;&lt;br /&gt;GC module is built into dynamic shared object binary (Windows DLL or ELF so) exporting a set of well-defined VM/GC interface. We know the symbols in dynamic shared object can only be accessed indirectly, which leads to runtime overhead of indirect function calling and parameter marshaling.&lt;br /&gt;&lt;br /&gt;We overcame the problem of boundary crossing with GC helpers. GC helper is a GC method written in Java, that can be jitted by JIT as normal Java code. We wrote several important GC functions into GC helpers such as allocation, hash code, write barrier, which solved the cross-boundary performance issue. &lt;br /&gt;&lt;br /&gt;There are some details about the GC helpers. &lt;br /&gt;&lt;br /&gt;Firstly, GC functions require to manipulate objects and native pointers, but Java doesn't support that. We introduced JIT intrinsics that are seemingly normal Java code but actually only symbols known to the JIT. &lt;br /&gt;&lt;br /&gt;Secondly, some GC functions have big bodies or even trigger the collection. It's unrealistic and unnecessary to write them in Java or we would have to write all GC in Java. The solution is we only write the fast-path of the GC function into its Java helper, and leave the slow-path to original native implementation the shared object.&lt;br /&gt;&lt;br /&gt;Thirdly, the major benefits of GC helpers come from inlining them into jitted code. Inlining has many advantages (as compiler guys knew), such as parameter constant folding and propagation, code scheduling and register allocation, etc.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Boundary adjustment&lt;/b&gt;&lt;br /&gt;Harmony GC abstracts space from heap management, composing different GC by combining multiple spaces. This might have a problem of heap under-utilization if the space size is statically partitioned.&lt;br /&gt;&lt;br /&gt;In order to achieve best heap utilization, Harmony GC permits the boundaries between different spaces to be adjusted at runtime. We developed sophisticated heuristics to decide the boundary adjustment time and delta. If the boundary is between two equal-level allocation spaces, it's adjusted according to the allocation rates of the spaces. If the boundary is between an allocation space and a collection space, it's adjusted according to the surviving ratio of the allocation space.&lt;br /&gt;&lt;br /&gt;Dynamic switching between collection algorithms and generational/non-generational modes will be discussed later.&lt;br /&gt;[1] Task-pushing (http://people.apache.org/~xli/papers/ipdps07-task-pushing.pdf)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-6281458984626197035?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/6281458984626197035/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=6281458984626197035' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6281458984626197035'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6281458984626197035'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/01/harmony-gc-internal-design-principles.html' title='[Harmony GC Internal] Design principles of Harmony GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-8162329044148166513</id><published>2008-01-18T21:20:00.001-08:00</published><updated>2008-01-20T01:09:59.100-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>[Harmony GC Internal] Mark bit design</title><content type='html'>First of all, we decide to put MARK_BIT in object header to indicate the object is live in most cases, except for Mark-Sweep algorithm. &lt;br /&gt;&lt;br /&gt;We used an extra mark_bit_table where one bit maps one word. The problem with mark_bit_table is in parallel marking. If two live objects are close, they can map to the bits in same byte. When two threads mark them in parallel, they will mark the bits in same byte at the same time. Since byte is the minimal memory access unit, their modifications have race condition unless we use atomic operation or use “byte to word” mapping. But atomic operation is too expensive, and “byte to word” wastes too much memory. So we choose to put the mark bit in the object header. &lt;br /&gt;&lt;br /&gt;Mark bit in object header doesn’t need atomic operation. It’s possible that two threads simultaneously try to mark the same object. That’s ok, because they will put the same value to the header, so there is no correctness issue. They might not know each other, and continue to scan the slots of the marked object. This could lead to very minimum redundant work, because it’s unlikely for them to keep synchronous in marking the whole object reachability tree. &lt;br /&gt;&lt;br /&gt;One issue with mark bit in object header is, we have to iterate over the heap to find all the live objects. And if we want to clear the mark bits of live objects, we need a heap pass as well. With mark_bit_table, we find the corresponding live objects by reading the table, and we can clear the table with memset. Normally for a GC that uses copying/forwarding algorithm, it doesn’t have the finding or clearing operations; then mark bit in header is better. (For Harmony GCv5 that uses compaction for major collection, the idea is that major collection happens rarely.)&lt;br /&gt;&lt;br /&gt;Ok, now let’s go to the design of mark bit in GCv5. I will try to describe the rationality behind as much as possible, but I might skip some details unintentionally or if it requires long history retrospection. Below is about MARK_BIT_FLIPPING design for non-gen minor collection. &lt;br /&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;During minor collection, we trace and forward live objects in NOS. The original copy of the forwarded object’s oi has a FORWARD_BIT set to indicate it’s forwarded, so that the same object is forwarded only once;&lt;br /&gt;&lt;li&gt;Since it’s nongen, the non-NOS objects will be traced as well, but not forwarded. In order to avoid the same object being traced multiple times, we put a MARK_BIT in its oi;&lt;br /&gt;&lt;li&gt;When other object references an object, it needs to check if the object is marked or forwarded. If it’s in NOS, it’s checked for forwarding; otherwise for marking;&lt;br /&gt;&lt;li&gt;After minor collection, NOS has only dead objects or those with FORWARD_BIT. They are useless and cleaned. Non-LOS has live objects and dead objects. Some of the live objects are there before the collection, with MARK_BIT now. Some are forwarded from NOS. We don’t want to clear those MARK_BITs, because the operation overhead can be big for a minor collection even with mark_bit_table;&lt;br /&gt;&lt;li&gt;In next collection, we will trace both NOS and non-NOS again. But the live objects in non-LOS have MARK_BIT already. When we meet them, we think they are traced and give up. So to deal with this case, we need to mark them with another bit. That is the FLIP_MARK_BIT. FLIP_MARK_BIT uses two LSB bits alternatively in collections. &lt;br /&gt;&lt;li&gt;The two LSB bits are for both MARK_BIT and FORWARD_BIT. When one bit is MARK_BIT, the other is FORWARD_BIT. They are flipped automatically. So you can use the names directly.&lt;br /&gt;&lt;li&gt;When we mark a live object in non-NOS, we will clear its original bit, no matter what it is, so that it has only current MARK_BIT. If the FORWARD_BIT (actually last time MARK_BIT) is not cleared, next time we will be confused to think it as marked with the MARK_BIT.&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;Two situations should be considered:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Slide-compaction uses oi field for target address, where it also puts the FORWARD_BIT to indicate the object has new address. It doesn’t care what the original bit left by last minor collection. But it should clear the FORWARD_BIT before the end of collection, because it will be treated as MARK_BIT in next minor collection. &lt;br /&gt;&lt;li&gt;When it’s a fallback major collection, some NOS live objects have FORWARD_BIT, some non-NOS live objects have MARK_BIT. Some live objects in NOS and non-NOS have no flag set. Major collection needs to mark all the live objects in a separate phase. It can’t use oi any more because there are bits there. We use vt LSB for major collection MARK_BIT. And we clear the two LSB bits in oi for live objects.&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Some details with partial-forward and semi-space GCs: &lt;ul&gt;&lt;br /&gt;&lt;li&gt;Partial-Forward. This is one of the collection algorithm for NOS in Harmony GC. It leaves the newborns in NOS and only copies the rest to MOS. The assumption is newborns are the most lively objects, so leaving them uncopied probably can save the collection overhead. &lt;br /&gt;&lt;br /&gt;In partial-forward collection, the old copy of a forwarded object has FORWARD_BIT set, and those live objects left uncopied can have MARK_BIT set, so that a reference to a NOS object can easily tell if it needs updating. The problem is, in next NOS minor collection, those uncopied objects will have the FORWARD_BIT there automatically. This makes it hard to know whether the FORWARD_BIT is newly set in this collection or a legacy of last collection. &lt;br /&gt;&lt;br /&gt;We use a simple mechanism to tell them difference. We introduce a third bit AGE_BIT in object header for those uncopied objects. In this way, we can set AGE_BIT in this collection for those uncopied objects and clear AGE_BIT in next collection because it is surely to be forwarded.&lt;br /&gt;&lt;br /&gt;So we use pointer comparison (in non-forward region or not), AGE_BIT and mark bits (MARK_BIT and FORWARD_BIT) together to deal with the partial forward case.&lt;br /&gt;&lt;br /&gt;&lt;li&gt; Semi-Space. We also implemented semi-space collection in Harmony GC for NOS collection. In a semi-space minor collection, most live objects are copied to to-space; some aged are copied to MOS. Since all of them are moved, so we use FORWARD_BIT for all the live objects in NOS. &lt;br /&gt;&lt;br /&gt;We use the AGE_BIT to indicate an object is in to-space, which is set when an object is forwarded to to-space. There is no pointer comparison to decide whether an object is in from-space or to-space so far. &lt;br /&gt;&lt;br /&gt;The situation becomes a little bit trickier in generational semi-space collection depending on the remember set implementation. If rem-set is implemented with slot-remembering, it's highly likely that there are lots of repetitive slots remembered. &lt;br /&gt;&lt;br /&gt;During the collection, a slot can hold a reference pointing to an object that is the new copy of a forwarded object, in this collection's to-space. This object has no FORWARD_BIT, but with AGE_BIT set. They could be confused with those yet-to-be-forwarded objects in last collection's to-space. GC needs a way to figure out that the former ones are new survivors being forwarded already, should not be forwarded again. &lt;br /&gt;&lt;br /&gt;In my design, GC simply does a pointer comparison to determine if the object is in the to-space of this collection. If that's true, GC simply ignores it, without even remembering the slot in collector's rem-set, because it must have been remembered when it's forwarded.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;The introduction of AGE_BIT is to reduce pointer comparison operations, since a logic bit operation and comparing to zero are cheaper than two pointer comparisons. To use flipping AGE_BIT might help to reduce more pointer comparison operations, but flipping requires the AGE_BIT be a global memory variable, whose operations might not be much cheaper than pointer comparisons.&lt;br /&gt;&lt;br /&gt;This is the same case for the flipping MARK_BIT. We can use pointer comparisons to determine if an object is in MOS, rather than to use a complementary bit pattern. There are no definite best design, just trade-offs.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-8162329044148166513?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/8162329044148166513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=8162329044148166513' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8162329044148166513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8162329044148166513'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2008/01/harmony-gc-internal-mark-bit-design.html' title='[Harmony GC Internal] Mark bit design'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-270655697009331236</id><published>2007-12-23T00:02:00.000-08:00</published><updated>2008-01-20T01:09:59.100-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>[Harmony GC Internal] Fallback compaction</title><content type='html'>Fallback compaction is a special kind of major collection, it’s triggered during the process of a minor collection when there is no enough space to hold the forwarded objects. This is possible because we might have reserved inadequate free space in MOS for NOS forwarding. One important note is, although the reserved space is not enough, the total heap size is still enough for all live objects (i.e., no OOME), because the all the live objects are already happily in the heap before the collection. &lt;br /&gt;&lt;br /&gt;We have to use compaction for fallback, because there is no free space available for copying. The complexity of fallback is that, the forwarded objects have two copies: the original one in NOS, and the new one in the MOS reserved space. But we can’t simply remove either copy. New copy is important, because the original copy’s oi field (obj_info field in object header for meta info) holds a forwarding pointer now, and the original oi info can only be found in the new copy. The original copy is also important, because there might be some other objects still referencing the old objects, thus they can only find the new copy through the forwarding pointer in the original copy. &lt;br /&gt;&lt;br /&gt;We can restore the original copies of the forwarded objects and remove the new copies. This includes the oi restoration in NOS, and all the repointed pointers' restoration (in whole heap, and in rootset). It’s not worth the big overhead if we can compact with two copies existing. Below we examine the mark-compact phases with LISP2 compactor as an example.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Phase 0:&lt;/b&gt; marking.  Fallback marking process has an extra operation. That is, to repoint all the references to the forwarded objects to their new location in MOS. When we trace a slot that references an object in NOS that is forwarded, we need find its new location and update the slot with the new value, then continue marking and scanning the new copy. After marking phase, no reference in heap and rootset points to the original copy of a forwarded object. Those original copies are useless and actually dead objects now. This approach is better than an extra original copy restoration phase. &lt;br /&gt;&lt;br /&gt;One important note: when we scan an original copy, we don’t check if it has been marked, but to check if it is in NOS and has been forwarded. Even if it’s scanned, we still need to find its new location to update the slot pointing to it. So this object is never set with MARK_BIT in vt, while it has oi set with FORWARD_BIT. &lt;br /&gt;&lt;br /&gt;(I will discuss the mark bits in detail in a later blog entry. But I'd put an important note about mark bit flipping here: MARK_BIT for normal major collection is not flipped because it is not used at all in compaction. More importantly, if the live objects’ MARK_BIT is untouched in major collection in some space, say LOS, the bit pattern remains there as it is in last time minor collection. The pattern will be correct for next time minor collection when MARK_BIT is flipped. For fallback compaction, MARK_BIT is the same as the minor collection that triggers the fallback without flipping. So the FORWARD_BIT in _NOS_ is still forward bit in semantic during compaction. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Phase 1:&lt;/b&gt; relocating. To compute the new address of every live object, and put new address (also called forwarding pointer) in oi, with FORWARD_BIT set. The original oi is saved with two LSB bits cleared. We find the live objects by iterating the heap in order for the those with vt marked. The original copy of forwarded object is not marked (as said above), so it’s skipped. &lt;br /&gt;&lt;br /&gt;There is one issue to deal with for fallback. When we iterate over the heap for next live object, we identify next object location from this object plus its size. Because of hashcode support, the object might have extra field for attached hashcode. The status is decided by the hash bits in oi, which requires oi info be sane. But the oi of the original copy of forwarded object is a forwarding pointer, which may mislead the size computation hence causing incorrect next object identification. We need a way to figure out correct hash status. &lt;br /&gt;&lt;br /&gt;(For example, in simple NOS forwarding GC, the objects in NOS have no extra hash fields, so GC knows the fact by checking if the object is in NOS. Another way is to set the vt with a bit during forwarding to indicate if it has hashcode attached. The bit should be different from the mark bit in vt, avoiding conflict with the mark bit in the marking phase. We can use an additional phase before Phase 1 to restore the hash bits pattern in the forwarded objects by checking the hash bit in vt; or we can always check the hash bit in vt for object size computation. Since the forwarded objects are dead objects, it doesn't matter to keep a hash bit in vt.) &lt;br /&gt;&lt;br /&gt;Once the target addresses are computed, the following phases are almost the same as normal major collection.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Phase 2:&lt;/b&gt; repointing. &lt;br /&gt;&lt;b&gt;Phase 3:&lt;/b&gt; moving. Since fallback means the reserved free space is inadequate for NOS forwarding, it is possible that MOS size is inadequate as well. In this case, some live objects have to stay in NOS after moving. We call them "MOS overflowed objects". There are a couple of solutions for this problem. &lt;ol&gt;&lt;br /&gt;&lt;li&gt;We can adjust NOS/MOS boundary dynamically when NOS_BOUNDARY is not mapped statically. In this case, it's quite simple.&lt;br /&gt;&lt;li&gt;If NOS_BOUNDARY can't be adjusted, we move those NOS live objects to MOS with an extra phase called EXTEND_COLLECTION. Every time for a NOS block, we unmap an emptied block in NOS, map a block in MOS, and move live objects from a NOS block to it. In this way, the total mapped memory size is kept under control. To keep the mapped memory size within budget is the reason why we do not just simply mapped more blocks after relocating phase. After relocating, although we know the block count remaining in NOS, we can't map same count of blocks in MOS before those in NOS are unmapped. Those NOS blocks can only be unmapped after moving phase is done. &lt;br /&gt;&lt;li&gt;Another solution is to keep those live objects in NOS, and deal with them as other NOS new-born live objects in next collection. This is the least solution I want to use, but it's useful if only a couple of blocks overflow.&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;This is pretty much about the fallback compaction design in Harmony GC. The last note is all the phases are conducted in parallel.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-270655697009331236?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/270655697009331236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=270655697009331236' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/270655697009331236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/270655697009331236'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/12/harmony-gc-internal-fallback-compaction.html' title='[Harmony GC Internal] Fallback compaction'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7988247365784252066</id><published>2007-12-22T22:10:00.000-08:00</published><updated>2008-01-20T01:09:59.101-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>[Harmony GC Internal] BUILD_IN_REFERENT and IGNORE_FINREF</title><content type='html'>Harmony GC has full support to weakreference and finalizer. But sometimes, it's easier for the developers to disable the support, so that they can focus on the algorithm proper and it also helps to debug. Harmony GC provides two options to control the finref support status. One is a macro BUILD_IN_REFERENT, the other is a command line option IGNORE_FINREF.&lt;br /&gt;&lt;br /&gt;BUILD_IN_REFERENT means weakrefs are processed as normal objects. It can be defined or not without impacting correctness. If it's defined, GC will keep the weakly-reachable objects as normal live objects (i.e., strongly reachable). In this situation, the finalizers of those weakly-reachable objects can not be executed, because they are not reclaimed. Some tests may depend their "pass" behavior on the finalizer execution, hence might not pass with BUILD_IN_REFERENT defined. (But it's arguable if the tests' behavior is correctly designed when depending on finalizer's timely processing.)&lt;br /&gt;&lt;br /&gt;On the other hand, IGNORE_FINREF means GC will not pass finalizable objects and nullified references to VM for processing, i.e., the ref' s enqueue method and the fin's finalizer method are not executed. To specify IGNORE_FINREF, use -XX:gc.ignore_finref=true in command line.&lt;br /&gt;&lt;br /&gt;So, there are following combinations of the two options:&lt;ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;If BUILD_IN_REFERENT &amp;&amp; IGNORE_FINREF , that means there is no finref concept at all. &lt;br /&gt;&lt;br /&gt;&lt;li&gt;If BUILD_IN_REFERENT &amp;&amp; ! IGNORE_FINREF, this case is incorrect, should never happen. If BUILD_IN_REFERENT is defined, we set IGNORE_FINREF in gc_init().&lt;br /&gt;&lt;br /&gt;&lt;li&gt;If ! BUILD_IN_REFERENT &amp;&amp; IGNORE_FINREF , that means weakrefs are processed in GC, but finrefs are not passed to VM for processing.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;If ! BUILD_IN_REFERENT &amp;&amp; ! IGNORE_FINREF, all the processings are conducted according to the spec. &lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;Case 1 is commonly used when developing a new GC algorithm. We can just define BUILD_IN_REFERENT in &lt;gc&gt;\src\finalizer_weakref\finalizer_weakref.h, and IGNORE_FINREF set automatically according to Case 2.&lt;br /&gt;&lt;br /&gt;Case 3 is usually used when developing the finref processing subsystem. &lt;br /&gt;&lt;br /&gt;Case 4 is the default setting.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7988247365784252066?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7988247365784252066/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7988247365784252066' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7988247365784252066'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7988247365784252066'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/12/harmony-gc-doc-buildinreferent-and.html' title='[Harmony GC Internal] BUILD_IN_REFERENT and IGNORE_FINREF'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1616569773195465402</id><published>2007-11-25T21:50:00.000-08:00</published><updated>2007-12-03T18:27:15.269-08:00</updated><title type='text'>One unified runtime for all languages?</title><content type='html'>Tamarin is &lt;a href="http://ejohn.org/blog/the-browser-scripting-revolution/"&gt;going to support CIL&lt;/a&gt;[1] with project IronMonkey. It would make Tamarin virtual machine to support a broad range of languages. The Tamarin team &lt;a href="http://ejohn.org/blog/why-tamarin-instead-of"&gt;argued &lt;/a&gt; [2] why they didn't choose other runtime systems for the purpose, specifically, JVM or Mono. To summarize, the technical reasons are mainly in the download size, memory footprint, slow start and language affinity. Do these really make much sense? I think it's yes and no for JVM. &lt;br /&gt;&lt;br /&gt;Although the runtime download size is a problem when people run with Java applet in their browser, it may not be problem if the JVM alreay exists in desktop, unless the JVM requires to be updated from time to time. But download size does matters when the Java applications are bundled with a JRE in their distributions. Ideally, JVM should be part of the platform as OS does today; then there is no need to download it frequently. Unfortunately, SUN hasn't run the model successfully; people today regard JVM as a separate software. One of the reasons is Microsoft, who tries to make .NET the default runtime for Windows, while Linux has not had a mature Java for years.   &lt;br /&gt;&lt;br /&gt;Another impression people have on Java is that Java is a heavy-weight memory eater. I guess people got the impression from early implementations of Java SE, and I guess this will not be a serious issue with the availability of cheaper RAM. Well people still perceive Java as heavy-weighted, because of its slow startup. For small Java applications, the startup is sometimes terribly slower than their native counterparts. Slow startup is a serious issue for desktop, but the issue has not been really resolved up to now. I believe its negative impact to Java's adoption has been large. Fat guys usually move slow, so people feel that Java is too fat, although slow startup and heavy weight are actually two different issues.&lt;br /&gt;&lt;br /&gt;Finally JVM was defined solely for Java, at least at the beginning. Tamarin guys don't believe JVM can behave equally well for other languages, specifically, ECMAscript. I personally think the affinity between JVM and Java may not be a serious issue, if the download size and slow startup problems are solved finally. With that said, although JVM definition may not be a real roadblock for other languages, a JVM implementation can be. My experience with JVM implementation is, it can be biased to Java semantics in various subtle ways.&lt;br /&gt;&lt;br /&gt;Mono has very similar issues as JVM. Parrot might not have the issues, but it is not yet done for Perl6. My question to Tamarin is, how it can avoid all the issues when it supports CIL.&lt;br /&gt;&lt;br /&gt;I guess Tamarin would work around many of the problems by targeting at only Ruby and Python. To support dynamic languages is different from static ones like Java and C#. Static languages welcome heavy-weight optimizations, which are not necessarily useful for dynamic languages. Some of the problems with JVM are sheerly caused by the heavy-weighted optimizations. I had written a JVM with a simple JIT and GC that has very small code size and memory footprint. But the GC had no parallel, no generational, and no concurrent collections; and JIT has almost no serious optimizations. &lt;br /&gt;&lt;br /&gt;In any case, I believe one can't achieve all the benefits without swallowing the disadvantages at the same time. For example, for server side, heavy-weight optimizations are really desirable, while client side cares more about response time. JVM today is more successful in server side, while ECMAscript only in client side. To have a single runtime suitable for all different usages sounds too difficult. Even Java alone has different editions.  &lt;br /&gt;&lt;br /&gt;Btw, Microsoft supports a unified runtime project based on CLR, called DLR [3], and SUN supports another unified runtime project based on JVM, called "JVM Languages" [4]. &lt;br /&gt; &lt;br /&gt;[1] http://ejohn.org/blog/the-browser-scripting-revolution/&lt;br /&gt;[2] http://ejohn.org/blog/why-tamarin-instead-of&lt;br /&gt;[3] http://blogs.msdn.com/hugunin/archive/2007/04/30/a-dynamic-language-runtime-dlr.aspx&lt;br /&gt;[4]http://groups.google.com/group/jvm-languages&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1616569773195465402?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1616569773195465402/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1616569773195465402' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1616569773195465402'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1616569773195465402'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/11/one-unified-runtime-for-all-languages.html' title='One unified runtime for all languages?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-2309161243636329740</id><published>2007-11-22T16:09:00.000-08:00</published><updated>2008-01-20T01:09:59.101-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Why Apache Harmony?</title><content type='html'>My last blog entry argues on whether &lt;a href="http://xiao-feng.blogspot.com/2007/11/will-apache-harmony-succeed.html"&gt;Apache Harmony will succeed&lt;/a&gt; [1], but I need argue firstly why people really need Apache Harmony.&lt;br /&gt;&lt;br /&gt;Here is the motivation given by Geir Magnusson, who was the project lead.&lt;br /&gt;&lt;blockquote&gt;There is a clear need for an open-source version of Java 2, Standard Edition (J2SE) runtime platform, and there are many ongoing efforts to produce solutions (Kaffe, Classpath, etc). There are also efforts that provide alternative approaches to execution of Java bytecode (GCJ and IKVM). All of these efforts provide a diversity of solutions, which is healthy, but barriers exist which prevent these efforts from reaching a greater potential.&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;But it didn't say what the real problems to solve with the open source Java implementation. Again there was an article two and half years ago summarizing the key problems that Apache Harmony probably could solve. That is &lt;a href="http://dynamicsemantics.blog-city.com/whyharmony.htm"&gt;Clear problems that having Apache Harmony will solve&lt;/a&gt;[2] by Floyd Marinescu. Repeat them below: &lt;i&gt;&lt;ol&gt;&lt;br /&gt;&lt;li&gt;Betting your infrastructure on technology from one vendor (Sun) who could one day stop it's offerings, either by going out of business, or changing priorities. Solution: An open source implementation can survive if independently if Sun dissapears one day or if suddenly the big vendors decide that there is more money to be made on some other platform, like Ruby. :) &lt;br /&gt;&lt;li&gt;Java can't come shipped/supported by some Linux - Many free flavours of Linux won't distribute Java due to Java not being open source. As a result, software developed for these distributions will not be done in Java. Solution: An open source Java would be adopted by the Linux community, increasing the value proposition of the "write once, run anywhere" mantra, and causing a lot more development to be done in Java.&lt;br /&gt;&lt;li&gt;Limited traction for desktop Java apps due in part to the linux problem - The Brazilians want Java to be part of the core, so they can have all kinds of desktop applications written in Java.&lt;br /&gt;&lt;li&gt;You can't ship a custom JVM with your application. There are a number of reasons why one might want to do this. You might need JVM’s with custom features, or ship apps with parts of JVM's embedded (would you prefer a 2meg exe file distributable that combines a mini-JRE and your app, or 100KB JAR that needs the entire JRE?), or implement remote updating for JVM's (like Windows Update), or simply making it easier to install an app that depends on Java.&lt;br /&gt;&lt;li&gt;Java is not available on all platforms that could benefit from it. Custom hardware solutions like hand held devices, kiosks, and other stuff that can't install a typical OS don't have Java implementations because no vendors will make enough profit to justify the effort and licensing fees. Solution: An open source Java means that corporate or government entities can implement custom JVM's that will run on those systems (think J2ME) and thus be able to leverage skill sets of existing java developers and investments in Java internally and for their customers.&lt;br /&gt;&lt;li&gt;The US goverment might embargo our country and cut "us" off. This, especially for emerging economies is a critical point. Countries like Brazil, China, India that want to use Java heavily, at a national scale are making themselves dependent on US corporations if they base their infrastructure on Java. If the US embargo's their country, they will lose rights to upgrade or get on support on Java-based tools from US corporations, possibly Java itself (I'm not on top of all the legal here but this is what I've heard). The US has a long history of intervention in many parts of the world, especially South America, so this is a real and present problem. Solution: An open source Java means that a US embargo would not have any effect on a foreign country's ability to continue upgrading and getting support for Java and derivative tools.&lt;/ol&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I don't believe in some of the problems listed above, such as the "US embargo" stuff. But I do agree with one of them, i.e., customized JVM. So far as I know, there are companies who are unsatisfied with SUN Java, but can't customize it for their products. Here I want to give some of my bullets besides the customization point.&lt;br /&gt;&lt;br /&gt;1. To have a Java really in public domain. Java is a programming language, but so far it is almost under SUN's sole control, even with the OpenJDK. Well yes the language is in public domain, but it is actually not if there is no implementation in public domain. Apache Harmony is the solution. This is not only important for some countries' strategy (which I said I don't believe in), but more important for the public to do whatever they want with Java. The first bullet in above list talks something similar.&lt;br /&gt;&lt;br /&gt;2. To be a runtime technology vehicle. There are lots of runtime systems, but only Java is most mature (and possibly also .NET). The technologies developed here in Apache Harmony can be applied/reused in other runtime systems, and people want runtime supports can use Apache Harmony. For example, more and more dynamic languages have their implementations on top of Java, in order to leverage the maturity of Java. The other example is Google Android, which is not Java but reuses Apache Harmony.&lt;br /&gt;&lt;br /&gt;Then why not OpenJDK? License really matters here. If JCP is really open, and if OpenJDK is really free, I believe OpenJDK will be preferred to Harmony. But it's not the case (yet). Note, TCK doesn't really matter in my points.&lt;br /&gt;&lt;br /&gt;[1] http://xiao-feng.blogspot.com/2007/11/will-apache-harmony-succeed.html&lt;br /&gt;[2] http://dynamicsemantics.blog-city.com/whyharmony.htm&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-2309161243636329740?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/2309161243636329740/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=2309161243636329740' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2309161243636329740'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2309161243636329740'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/11/why-apache-harmony.html' title='Why Apache Harmony?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7247887510720670103</id><published>2007-11-22T05:46:00.000-08:00</published><updated>2008-01-20T01:09:59.101-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Will Apache Harmony succeed?</title><content type='html'>There was a &lt;a href="http://www.javalobby.org/java/forums/t18646.html"&gt;survey&lt;/a&gt; [1] two and half years ago when Apache Harmony was started, to get people's comments on Apache Harmony's fate. Reading through it, I felt more pessimistic or negative opinions than optimistic or positive ones. Many people believed Harmony was either useless or going to fail, although they had different reasons.&lt;br /&gt;&lt;br /&gt;About nine months ago, an article &lt;a href="http://www.crn.com/article/printableArticle.jhtml?articleId=197003131"&gt;"How To Tell The Open-Source Winners From The Losers"&lt;/a&gt; [2] tried to summarize why an open source project could fail. Those are nine points to check:&lt;ol&gt;&lt;br /&gt;&lt;li&gt;A thriving community: A handful of lead developers, a large body of contributors, and a substantial--or at least motivated--user group offering ideas.  &lt;br /&gt;&lt;li&gt;Disruptive goals: Does something notably better than commercial code. Free isn't enough.  &lt;br /&gt;&lt;li&gt;A benevolent dictator: Leader who can inspire and guide developers, asking the right questions and letting only the right code in.&lt;br /&gt;&lt;li&gt;Transparency: Decisions are made openly, with threads of discussion, active mailing list, and negative and positive comments aired.  &lt;br /&gt;&lt;li&gt;Civility: Strong forums police against personal attacks or niggling issues, focus on big goals.  &lt;br /&gt;&lt;li&gt;Documentation: What good's a project that can't be implemented by those outside its development?&lt;br /&gt;&lt;li&gt;Employed developers: The key developers need to work on it full time.  &lt;br /&gt;&lt;li&gt;A clear license: Some are very business friendly, others clear as mud.  &lt;br /&gt;&lt;li&gt;Commercial support: Companies need more than e-mail support from volunteers. Is there a solid company employing people you can call? &lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;Using this checklist to measure Harmony, though Harmony has good scores for most of the points, Charles doubted "what passionate user community will form around Harmony when open Java is available on the Net?" &lt;br /&gt;&lt;br /&gt;I have to say Charles has made very valid points largely for open source projects, but I can't agree that Harmony is losing developers due to OpenJDK. I won't elaborate my arguments, just one point here: Harmony is not necessarily existing only as an alternative Java implementation. So Harmony is not necessarily losing its developers, because they are not just looking for an alternative Java implementation. For this specific point, I have a couple of examples:&lt;ul&gt;&lt;br /&gt;&lt;li&gt; Google Android uses Apache Harmony for its class libraries; &lt;br /&gt;&lt;li&gt; People are porting Harmony GC(s) to other runtime systems;&lt;br /&gt;&lt;li&gt; Some Java applications do not care if Harmony is Java certified, using Harmony as default runtime environment.&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Let's see how Apache Harmony is going to evolve. It's still too young (less than three years old). Stay tuned.&lt;br /&gt;&lt;br /&gt;[1] http://www.javalobby.org/java/forums/t18646.html&lt;br /&gt;[2] http://www.crn.com/article/printableArticle.jhtml?articleId=197003131&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7247887510720670103?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7247887510720670103/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7247887510720670103' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7247887510720670103'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7247887510720670103'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/11/will-apache-harmony-succeed.html' title='Will Apache Harmony succeed?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7316066387294451081</id><published>2007-11-21T04:54:00.000-08:00</published><updated>2008-01-20T01:08:59.599-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Idea to encode object header in 64-bit even in a 64-bit environment</title><content type='html'>Object header usually have two machine words: one for vtable pointer, the other for some object meta-data such as lock support, hash info, etc. In a 64-bit platform, we still want to use two 32-bit words for object header. &lt;br /&gt;&lt;br /&gt;This is easy to achieve if heap size is no bigger than 4GB. In this case, it's virtually still a 32-bit platform. There is no problem to encode the pointers into 32-bit values. It's the compressed reference idea I've talked in previous blog entries [1] (and [2]). But the problem is if we can use two 32-bit words for unlimited heap size in a 64-bit platform.&lt;br /&gt;&lt;br /&gt;My answer is yes. For vtable pointer, it can be encoded as before since all the vtables in the system can be put into a small region whose size is far smaller than 4GB. We can use the offset of a vtable in the region to represent the vtable, and compute its value by adding the region base address. Object metadata need no more than 32 bits, so it's not a problem either. The real problem is, during garbage collection, object header words are usually used to store the forwarding pointer. We must guarantee that the forwarding pointer be stored correctly.&lt;br /&gt;&lt;br /&gt;In a copying GC, the original object copy is kept when its new copy is allocated in another place. That means its header words info in original copy are basically useless. We can combine the two header words and put a 64-bit forwarding pointer there. The situation is a little more complicated in a compaction GC. &lt;br /&gt;&lt;br /&gt;In compacting GC, the objects are packed to one end of the heap, overwriting the original copies, such as in a LISP2 compactor. We can't let a 64-bit forwarding pointer to use the two header words during compaction, because the vtable word is useful, should be kept not overwritten before the final object moving phase. &lt;br /&gt;&lt;br /&gt;But I found compaction GC has a good property that it compacts a continuous source region into another continuous target region. If the target region is &lt;4B, we can use its base address for forwarding pointer compression, shown below.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_4MytEQFYkbU/R0Q0JNNZ8qI/AAAAAAAAAAM/FvP_2LCP8p8/s1600-h/4G.bmp"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://4.bp.blogspot.com/_4MytEQFYkbU/R0Q0JNNZ8qI/AAAAAAAAAAM/FvP_2LCP8p8/s320/4G.bmp" border="0" alt=""id="BLOGGER_PHOTO_ID_5135286807878562466" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;If the target region is more than 4G, we partition it into 4G segments. We keep the base addresses of them for offset computation. We also keep the heap into multiple segments so that each source segment is compacted into a target region. We keep the mapping table between the source region address ranges and target region base addresses. For every object reference, we can compute its corresponding offset in its target region.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_4MytEQFYkbU/R0Q0edNZ8rI/AAAAAAAAAAU/jA0jZHAezJ8/s1600-h/8G.bmp"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;" src="http://1.bp.blogspot.com/_4MytEQFYkbU/R0Q0edNZ8rI/AAAAAAAAAAU/jA0jZHAezJ8/s320/8G.bmp" border="0" alt=""id="BLOGGER_PHOTO_ID_5135287172950782642" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So far the idea above is enough for Harmony GC to support a 64-bit object header.&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/05/harmony-gcv5-64-bit-support_03.html&lt;br /&gt;[2]http://xiao-feng.blogspot.com/2007/10/more-notes-on-concept-of-managednull.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7316066387294451081?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7316066387294451081/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7316066387294451081' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7316066387294451081'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7316066387294451081'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/11/idea-to-encode-object-header-in-64-bit.html' title='Idea to encode object header in 64-bit even in a 64-bit environment'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_4MytEQFYkbU/R0Q0JNNZ8qI/AAAAAAAAAAM/FvP_2LCP8p8/s72-c/4G.bmp' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4320071847354225206</id><published>2007-11-21T04:51:00.000-08:00</published><updated>2008-01-20T01:10:37.089-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>A better bitmap design for mark-sweep GC</title><content type='html'>Mark-sweep GC usually uses bitmap in chunk header or elsewhere to record some metadata for the objects in the chunk. The bitmap is basically used for object marking bit, showing the objects are live or not in the chunk. With bitmap, GC can find live objects without touching them, by scanning the bitmap. For example, one bit for one object. The object is live if the bit is set. If more metadata are needed, the bit count can increase accordingly. We have an interesting innovation in bitmap design.&lt;br /&gt;&lt;br /&gt;Currently, all the mark-sweep GCs known to me use a simple mapping between bitmap to objects. For example, one bit is used for one word, assuming the object is allocated aligned at word bundary. This is good enough because an object can be allocated virtually anywhere, so any bit in the bitmap can be potentially mapped to an object's first word. If the objects allocated always align at 8-byte boundary, one bit can represent 8 bytes. If more metadata is needed, say, 4 bits are needed, then it is 4 bits corresponding to 8 bytes. The bitmap/object size ratio is 4/(8*8), which is about 6%. Certainly we want this ratio to be as small as possible.&lt;br /&gt;&lt;br /&gt;Sometimes, for performance reason, we may sacrifice some space for time. For example, in modern GC, usually multiple collector threads are employed for collection work. The marking process is also parallelized, i.e., multiple collectors mark the bits in bitmap simultaneously. If the bits for two objects can be in the same byte, atomic operations are necessary to manipulate the byte correctly by multiple threads. Assume the objects are larger than 12 bytes (vtable, object info, and object fields), one object maps to one byte. By arranging the mapping well, we can guarantee that two objects never map to the bits in same byte. This saves the atomic operations for marking, which are very expensive.&lt;br /&gt;&lt;br /&gt;So far there is nothing new, Next is about our innovation. It's In modern mark-sweep GC, no matter generational or concurrent or whatever, pre-sized chunks are often used to reduce the heap fragmentation and improve the thread-local allocation speed. That is, each chunk has a specified entry size so that only the objects of that size are allocated in it. In this case, we can use one bit for the whole object, rather than one bit for certain fixed number of bytes. If the average object size in pre-sized chunks is 40 bytes, the bitmap/object size ratio is 1/(40*8), a very small neglible value. &lt;br /&gt;&lt;br /&gt;But this is not very much a gain in performance, if not a loss. The reasons are:&lt;br /&gt;1. The mapping operation from an object to its bit in bit map is expensive. It's roughly computed in way of BitPos = (ObjAddr - ChunkAddr)/EntrySize. The divison is very expensive, might be similar to atomic operation.&lt;br /&gt;2. One bit for one object has a problem in parallel marking when multiple collectors contend in the same byte for different bits.&lt;br /&gt;&lt;br /&gt;We can improve the method in both issues.&lt;br /&gt;1. Division operation. If the divisor is a number of 2's power, we can use bit-shift for division. So the improvement idea is to choose a 2's power for the division. We choose the maximal 2's power that is smaller than the entry size of the chunk. In this way, some bits in the bitmap have no objects to map with. It's a negligible space waste (the waste is smaller than the original bitmap size because of choosing method for the divisor), but it is a big saving in execution time by eliminating the division operation.&lt;br /&gt;2. Atomic operation. For this problem, if it's serious in some workloads, we can use one byte for one object (entry). If the average object size is 40 bytes, the bitmap/obj size ratio is 1/40, only 2.5%, still much smaller than the traditional solution.&lt;br /&gt;&lt;br /&gt;We prototyped this technique in Apache Harmony GC. There were some problems to implement the idea directly, so we didn't put it into the SVN code base.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4320071847354225206?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4320071847354225206/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4320071847354225206' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4320071847354225206'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4320071847354225206'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/11/better-bitmap-design-for-mark-sweep-gc.html' title='A better bitmap design for mark-sweep GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5306809969892175735</id><published>2007-11-12T15:47:00.000-08:00</published><updated>2008-01-20T01:09:59.102-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Google Android SDK published with Apache Harmony used</title><content type='html'>Google just published its Android SDK. Some modules developed in Apache Harmony are used in Android Java stack. I just downloaded the SDK from http://code.google.com/android/ . &lt;br /&gt;&lt;br /&gt;Here is a quick list of the modules used:&lt;br /&gt;&lt;br /&gt;annotation&lt;br /&gt;archive&lt;br /&gt;auth&lt;br /&gt;crypto&lt;br /&gt;instrument&lt;br /&gt;kernel&lt;br /&gt;logging&lt;br /&gt;luni&lt;br /&gt;math&lt;br /&gt;misc&lt;br /&gt;nio&lt;br /&gt;niochar&lt;br /&gt;prefs&lt;br /&gt;security&lt;br /&gt;sound&lt;br /&gt;sql&lt;br /&gt;text&lt;br /&gt;xnet&lt;br /&gt;&lt;br /&gt;It a big deal to Apache Harmony. I'd like to get the Harmony DRLVM modules into Android as well. Let's see.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5306809969892175735?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5306809969892175735/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5306809969892175735' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5306809969892175735'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5306809969892175735'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/11/google-andriod-sdk-published-with.html' title='Google Android SDK published with Apache Harmony used'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7141768032300667194</id><published>2007-10-25T18:14:00.000-07:00</published><updated>2008-01-20T01:09:59.102-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>More notes on the concept of managed_null</title><content type='html'>Salikh gave some good comments on my last blog entry [1], where I suggested to reconsider the concept of managed_null introduced by compressed reference technique. In summary, Salikh thinks the concept of managed_null is to provide a handy tool for VM developers. But I actually didn't feel it's handy if not confusing in my development experience. &lt;br /&gt;&lt;br /&gt;In my opinion, the only true concern about removing managed_null is performance issue. Compressed ref technique was invented for performance improvement, we surely want to keep managed_null if its removal causes performance degradation. When the application accesses a compressed ref, it does following:&lt;br /&gt;   NativeRef = CompressedRef + HeapBase;&lt;br /&gt;&lt;br /&gt;If we have managed_null (the value is HeapBase), then the NativeRef for a NULL (0) CompressedRef is:&lt;br /&gt;   NativeRef = NULL + HeapBase;  =&gt;   NativeRef = managed_null;&lt;br /&gt;That is, a NULL compressed ref is still a conceptual NULL (managed_null).&lt;br /&gt;Since a dereference always requires null pointer checking, when we memory protect the position of managed_null (heap_base), its dereference leads to a trap the same as that of a real value 0 reference. &lt;br /&gt;&lt;br /&gt;If we don't want to use the idea of "managed_null is also a NULL (conceptually)", we need keep a same NULL value for both CompressedRef and NativeRef. Then the decompression computation becomes:&lt;br /&gt;   NativeRef = (CompressedRef == NULL)? NULL: (CompressedRef + HeapBase);&lt;br /&gt;This has certain additional overhead for comparison and branch. Mikhail Fursov and Pavel Pervov have come up a simple idea to solve the problem by using conditional move instruction. So this problem is no longer a problem.&lt;br /&gt;&lt;br /&gt;I am thinking more about the issue: When a compressed ref value can be NULL? &lt;br /&gt;References are set NULL when object is initialized, when a field is set NULL explicitly by the code, when a ref field is assigned with a input/return parameter. If the ref is managed_null, it can be written back with a direct subtraction (ref - HeapBase). Otherwise, we need another comparison:&lt;br /&gt;   CompressedRef = (NativeRef==NULL)? NULL: (NativeRef - HeapBase). &lt;br /&gt;This is runtime overhead in compression, which we should avoid as well.&lt;br /&gt;&lt;br /&gt;But then the issue is, why we bother to do the additional comparison in the compression and decompression. I have following answer:&lt;br /&gt;&lt;br /&gt;There are no needs to do those comparisons in most cases. That I suggested to remove managed_null doesn't mean we can't use HeapBase temporarily at runtime. We still can use   "NativeRef = CompressedRef + HeapBase" to decompress a ref, and we still mem-protect HeapBase position to catch null pointers. We only need to convert a compressed NULL to a uncompressed native NULL in register when there is explicit null-checking or the register value is passed to other methods. E.g., when it's passed to a native method, a NULL value is obviously more convenient than a managed_null as its input parameter.&lt;br /&gt;&lt;br /&gt;In any case, my personal suggestion was to remove the concept of managed_null (i.e., that can be NULL and heap_base in different modes) in JVM development, but to use NULL and heap_base with clear distinction. It’s only a convention issue. JIT still can choose whatever good technique for best performance. As long as the conventions (calling across methods, native and Java, runtime helper) are agreed between the JIT and other components. That is, to mem-protect heap_base can be an optimization, but we are not forced to regard heap_base as another NULL. &lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/10/reconsider-managednull-concept-in.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7141768032300667194?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7141768032300667194/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7141768032300667194' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7141768032300667194'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7141768032300667194'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/10/more-notes-on-concept-of-managednull.html' title='More notes on the concept of managed_null'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4381631953360799879</id><published>2007-10-23T18:08:00.000-07:00</published><updated>2008-01-20T01:09:59.102-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Reconsider the managed_null concept in Harmony DRLVM</title><content type='html'>DRLVM has two types of NULL references, one is the plain NULL (physically 0), the other is managed_null. The original idea to have managed_null is to make Java NULL reference opaque to developers, so that they don't need to know the exact value of managed_null, but to know it is representing Java NULL reference. It can be 0, or whatever appropriate value, or a variable at runtime.&lt;br /&gt;&lt;br /&gt;The idea has no problem and is actually interesting. Normally managed_null is just 0. So there is no real difference between managed_null or NULL. The difference is only conceptual. Then when compressed reference is introduced, it has some true difference in DRLVM.&lt;br /&gt;&lt;br /&gt;Compressed reference is a useful technique for 64-bit JVM development, where a native pointer has 64-bit. Since mostly the Java heap size is below 4GB, a Java reference can be represented with a 32-bit value. A straightforward way is to store the 32-bit   offset to the heap base address in objects for Java references. When the program wants to deference it, the 32-bit value is added with the heap base address and the result is used as the real native pointer. &lt;br /&gt;&lt;br /&gt;Experiments showed that, compressed reference technique significantly reduces memory footprint, since there are lots of object fields are for references, which then can be encoded in 32-bit values. Although it introduces some runtime overhead to add/subtract the heap base address, the performance benefit is obvious due to the virtually enlarged heap size, and reduced cache misses with more packed objects. Currently almost all commercial JVM implementations use compressed ref technique for 64-bit platforms.&lt;br /&gt;&lt;br /&gt;Since there is no object allocated below the heap base address, which can be regarded as the value for NULL reference, with or without compressed ref technique. But for some reason, DRLVM chooses to set managed_null to be the heap base address in compressed ref mode. I can't recall the reason now. But I guess it's actually due to some confusion about the value of NULL reference. &lt;br /&gt;&lt;br /&gt;With compressed ref technique, the heap base address has offset 0, so it gives people an impression that it can be the opaque representation for NULL reference. But I have to say this is a wrong impression, because it actually contradicts with the original idea for managed_null's introduction. In this design, people give managed_null the value of heap base address exactly because they KNOW its offset (compressed ref) is 0，but the original idea for managed_null is to make it value opaque. And for every deference, the compressed ref value needs to be added with the managed_null. So the existence of managed_null is exactly for VALUE computations! This is obviously wrong conceptually.&lt;br /&gt;&lt;br /&gt;It also leads to some problems:&lt;br /&gt;1. Null-pointer-exception (NPE) handling. If managed_null is 0 physically, the VM can utilize platform support for NPE handling, because a dereference on value 0 always causes a trap in current OS design (with HW support). But the deference of managed_null of heap base address may not necessarily lead to a trap, unless the page is memory R/W protected. This is inconvenient. And more importantly, an unset reference field in an object is commonly initialized to be 0 (together with other fields) when the object is allocated. It's inconvenient to set the reference fields to be the heap base address. So current implementation still uses 0 for NULL reference. This brings the VM two conceptual NULLs, one real native NULL, the other managed_null.&lt;br /&gt;&lt;br /&gt;2. No matter if the object reference fields are initialized to be NULL or managed_null, it still has problem. The heap might be composed of multiple segments, each is 4GB in size. Then each has a separate heap base address. That means, we will have multiple different values for managed_null. It's really cumbersome. &lt;br /&gt;&lt;br /&gt;So I suggested to remove this managed_null concept, or to unify managed_null and NULL (0) by using managed_null just for NULL. It's somehow still good to distinct the concepts of managed_null and NULL, although they have same value. Then NULL is only for native pointer 0.&lt;br /&gt;&lt;br /&gt;With compressed ref, the 0 offset (heap base address) should be dealt with specially, because a ref value 0 can be real NULL or the compressed heap base. A simple way is to never allocate an object at the heap base, so there is no valid 0 value for compressed ref. &lt;br /&gt;&lt;br /&gt;Btw, with compressed ref technique, larger than 4GB heap can be supported with single heap base address. The trick is to leverage the fact that, GC always allocates Java obejcts at bit-aligned addresses. If it's n-bit aligned, the n LSBs are always 0, which can be used to encode more address information. For example, if objects are 3-bit aligned, we can compute the compressed ref value in way of:&lt;br /&gt;    CompressedRef = ( NativeAddr - HeapBaseAddr ) &gt;&gt; 3;&lt;br /&gt;    NativeAddr = CompressedRef &lt;&lt; 3 + HeapBaseAddr;&lt;br /&gt;So 32-bit compressed ref can represents 35-bit native address range, which is 32GB.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4381631953360799879?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4381631953360799879/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4381631953360799879' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4381631953360799879'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4381631953360799879'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/10/reconsider-managednull-concept-in.html' title='Reconsider the managed_null concept in Harmony DRLVM'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7174210522820597087</id><published>2007-10-23T18:04:00.001-07:00</published><updated>2008-01-20T01:09:59.103-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>What language to choose for Harmony GC development?</title><content type='html'>During Harmony GC development starting from GCv5, I made a design decision that we should try to keep the programming language "C" compatible. Although GCv5 uses certain C++ style, e.g., the source file names are using cpp suffix, we tried to avoid any C++ specific things.&lt;br /&gt;&lt;br /&gt;The reasons I chose C for GCv5 development are:&lt;br /&gt;&lt;br /&gt;1. We want GCv5 to control all its own memory, i.e., there is no hidden memory management brought in by the language. Writing GC in C++ doesn’t cause serious problem in this issue, but the problem is obvious when writing GC in Java, where there are lots of hidden objects allocated.&lt;br /&gt;&lt;br /&gt;2. We try to keep GCv5’s capability for other runtimes written in C, which is common in open source community, such as Linux kernel, gcj, Ruby, etc. We expect one day GCv5 can be applied to some of them, although I don’t know when it will. We had successfully ported a version of GC to Ruby 1.9 last year.&lt;br /&gt;&lt;br /&gt;3. We want to make GCv5 self-sufficient with all its own encapsulated utils, so that it can be easily ported to other languages if we want. For example, this design makes it very easy to write GC Runtime Helpers in Java. (I will talk about the runtime helpers later.)&lt;br /&gt;&lt;br /&gt;4. In the early stage of GCv5 development, I used some C++ data structures such as linked list, vector, etc., then I removed them gradually, because we want to have delicate control on synchronous accesses to the data structures, such as sync-queue, sync-list, etc.&lt;br /&gt;&lt;br /&gt;The arguments above are not strong enough though. I have to say it’s also kind of perfectionism. :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7174210522820597087?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7174210522820597087/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7174210522820597087' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7174210522820597087'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7174210522820597087'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/10/what-language-to-choose-for-harmony-gc.html' title='What language to choose for Harmony GC development?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4890300892117342083</id><published>2007-10-09T20:45:00.000-07:00</published><updated>2008-01-20T01:09:59.103-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>EIOffice with Harmony</title><content type='html'>It's a good news to Apache Harmony community that, a developer testing version (v0.02) of "EIOffice with Harmony" bundle was released on sourceforge at http://sourceforge.net/projects/eio-harmony/. &lt;br /&gt;&lt;br /&gt;EIOffice is an office suite written in pure Java based on Java swing. It has complete functionalities for document processing, presentation creation, and spreadsheet generation. EIOffice is developed by Evermore Software Co., a company locating in Wuxi city, Jiangsu province of China. EIOffice is the abbreviation of "Evermore Integrated Office", so sometimes it's called EIO as well. It's "integrated" because it is a single application that support all the three types of office documents processing (document, presentation and spreadsheet). The data can be "linked" between them, such as from a spreadsheet into a report presentation. One an update is made to one site, all its linked sites will be updated accordingly and automatically.&lt;br /&gt;&lt;br /&gt;The real appealing feature of EIOffice to a software developer is, it's written in pure Java, but there is no obvious performance issue in my using experience. And the memory footprint is acceptable (or surprisingly lower than expected).&lt;br /&gt;&lt;br /&gt;To make EIOffice to work with Harmony is a serious exercise on Harmony graphics classlib support (swing/awt/Java2D). It's said EIOffice the world-largest single (desktop) application written in Java. Once Harmony can run all its functionalities smoothly, that probably means Harmony is ready for any Java desktop applications. The current bundle is version 0.02. It's still a long way to go, but considering the fast progress of Apache Harmony development, I believe a version 1.0 could be expected in a couple of quarters.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4890300892117342083?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4890300892117342083/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4890300892117342083' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4890300892117342083'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4890300892117342083'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/10/eioffice-with-harmony.html' title='EIOffice with Harmony'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-6051151266019954521</id><published>2007-08-25T06:04:00.000-07:00</published><updated>2008-01-20T01:08:59.599-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Issues with Ruby GC</title><content type='html'>Last September I investigated Ruby scripting language and found some issues with its GC design. A new GC should improve in these issues.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. Fixed object length:&lt;/b&gt;&lt;ul&gt;&lt;br /&gt;Pros:&lt;br /&gt;&lt;li&gt;Simplify algorithm for dynamic objects&lt;br /&gt;&lt;li&gt;No inter-object fragmentation&lt;br /&gt;&lt;br /&gt;Cons:&lt;br /&gt;&lt;li&gt;Inner-object fragmentation&lt;br /&gt;&lt;li&gt;Requires extra C spaces distributed in C heap&lt;br /&gt;&lt;li&gt;Bad object locality&lt;br /&gt;&lt;br /&gt;Improvement idea&lt;br /&gt;&lt;li&gt;Allocate data for an object dynamically, put together in collection&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2. Mixed Ruby/C Heaps&lt;/b&gt;&lt;ul&gt;&lt;br /&gt;Pros:&lt;br /&gt;&lt;li&gt;Supports dynamic objects&lt;br /&gt;&lt;li&gt;Easy Extensibility with C modules&lt;br /&gt;&lt;br /&gt;Cons:&lt;br /&gt;&lt;li&gt;Bad object locality&lt;br /&gt;&lt;li&gt;High GC overhead, need to trace dead objects for C spaces release (~10% GC time)&lt;br /&gt;&lt;br /&gt;Improvement idea&lt;br /&gt;&lt;li&gt;Use unified heap&lt;br /&gt;&lt;li&gt;C extension still needs considering&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3. Non-moving Collection&lt;/b&gt;&lt;ul&gt;&lt;br /&gt;Pros:&lt;br /&gt;&lt;li&gt;Support conservative root set enumeration and object scanning&lt;br /&gt;&lt;br /&gt;Cons:&lt;br /&gt;&lt;li&gt;Heap fragmentation&lt;br /&gt;&lt;li&gt;Object access locality&lt;br /&gt;&lt;br /&gt;Improvement idea&lt;br /&gt;&lt;li&gt;Compact the moveable objects&lt;br /&gt;&lt;li&gt;Release the freed pages&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;4. Long Live Objects&lt;/b&gt;&lt;ul&gt;&lt;br /&gt;AST tree holds lots of long live objects&lt;br /&gt;&lt;li&gt;T_NODE&lt;br /&gt;&lt;li&gt;Takes ~90% live objects in a Rails application&lt;br /&gt;&lt;br /&gt;Improvement idea&lt;br /&gt;&lt;li&gt;Introduce generational GC, or&lt;br /&gt;&lt;li&gt;Deal with AST specially&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;5. Parallelisms&lt;/b&gt;&lt;ul&gt;&lt;br /&gt;Ruby threading model&lt;br /&gt;&lt;li&gt;Use a large lock for most operations&lt;br /&gt;&lt;li&gt;Yield to switch context&lt;br /&gt;&lt;li&gt;No parallelisms in either allocation and collection&lt;br /&gt;&lt;br /&gt;Improvement idea&lt;br /&gt;&lt;li&gt;Make allocation parallel at first&lt;br /&gt;&lt;li&gt;Ruby needs a better threading model&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;My experiments showed GC takes big portion in total execution time. Hopefully a new GC can reduce GC time substantially.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-6051151266019954521?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/6051151266019954521/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=6051151266019954521' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6051151266019954521'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6051151266019954521'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/08/issues-with-ruby-gc.html' title='Issues with Ruby GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3139140306574182198</id><published>2007-08-22T08:24:00.000-07:00</published><updated>2008-01-20T01:08:59.600-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Software prefetching techniques for GC marking phase</title><content type='html'>We experimented three different software prefetching algorithms in Harmony GC. We used them to improve the marking phase, since it is most time consuming and most&lt;br /&gt;memory intensive.&lt;ul&gt;&lt;br /&gt;&lt;li&gt;prefetch-on-grey(POG): The collector prefetches target object when it is pushed onto the mark stack, which has a prefetch distance equal to the interval between the time an object is pushed and the time it is popped.&lt;br /&gt;&lt;li&gt;buffered-prefetch(BP): The collector maintains an extra prefetch buffer queue. Objects are enqueued to the buffer from the mark stack till the queue is full or the mark stack is empty. An object in tail of the buffer is prefetched while the object in head of the buffer is scanned.&lt;br /&gt;&lt;li&gt;prefetch-without-mark(PWM): The collector puts all reference fields onto the mark stack instead of only those to unmarked objects, so as to delay the access to the objects. A referenced object can be prefetched right before it is going to be checked and marked.&lt;/ul&gt;&lt;br /&gt;In our experiments, "prefetch-on-grey" is the most efficient technique normally. "prefetch-without-mark" can benefit as well sometimes for copying GC. "buffered-prefetch" can hardly bring any benefit due to its high overhead.&lt;br /&gt;&lt;br /&gt;Mostly interestingly, when we turned on hardware prefetchers at the same time of software prefetching, we found the effects were better than or equal to the product of the speedups achieved by the two prefetchers when applied separately.&lt;br /&gt;Copying GC cannot get the double pay as others possibly due to the high bandwidth requirement of the "prefetch-without-mark" algorithm. But the hybrid prefetching is still better than any single one anyway.&lt;br /&gt;&lt;br /&gt;We reported the results in paper "Behavior Characterization and Performance Study on Compacting Garbage Collectors with Apache Harmony"[1].&lt;br /&gt;&lt;br /&gt;[1] http://people.apache.org/~xli/docs/caecw07-compacting-GCs.pdf&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3139140306574182198?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3139140306574182198/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3139140306574182198' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3139140306574182198'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3139140306574182198'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/08/software-prefetching-techniques-for-gc.html' title='Software prefetching techniques for GC marking phase'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1154012616533089424</id><published>2007-06-30T02:31:00.000-07:00</published><updated>2007-06-30T03:49:41.604-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Improve LOS performance in Harmony GCv5</title><content type='html'>Without changing the fundamental design of LOS, we still can improve overall GC throughput by finding out the mismatch in current LOS design and real workloads characteristics with large objects.&lt;br /&gt;&lt;br /&gt;In current GCv5 LOS implementation, we have following design decisions:&lt;ol&gt;&lt;br /&gt;&lt;li&gt;LOS is only collected in major collection.&lt;br /&gt;&lt;li&gt;LOS full triggers major collection, and LOS will be collected by compaction.&lt;br /&gt;&lt;li&gt;If major collection is not triggered by LOS full, LOS will be collected with mark-sweep.&lt;/ol&gt;&lt;br /&gt;The rationalities discussed in Adaptive LOS Size Adjustment at Runtime [1] are mostly general folk beliefs, that do not have strong data support. In our experiences with many Java workloads, we found these common wisdoms may not hold, and some are even wrong!! Here I'd like to point out some facts when re-iterating the above design decisions with new comments.&lt;ol&gt;&lt;br /&gt;&lt;li&gt;LOS is only collected in major collection. &lt;br /&gt;(&lt;b&gt;Note:&lt;/b&gt; This design decision is quite questionable. Mark-sweep is not expensive itself. More importantly, the major part of mark-sweep is in its marking phase, say, about 95% of total time. In non-generational mode of GCv5, the marking phase traces LOS anyway, so why not to piggyback a sweeping phase blissfully?)&lt;br /&gt;&lt;li&gt;LOS full triggers major collection, and LOS will be collected by compaction.&lt;br /&gt;(&lt;b&gt;Note:&lt;/b&gt; This is not necessary. LOS full can trigger minor collection and then LOS can be collected with mark-sweep. The cost of minor collection is much smaller than major one normally, and mark-sweep of LOS might be enough to satisfy LOS allocation request. An improvement can be that, the GC always do a minor collection for LOS full. If it cannot satisfy the failed LOS allocation request, GC conducts a major collection with compaction. Compared to original design, the improvement design in the worst case has an extra minor collection plus LOS mark-sweep before the expected major compaction. Since minor collection is much cheaper compared to major one, it may not really bring much overhead. On the other hand, if the minor collection can satisfy LOS allocation request, the overall GC throughput will be increased.)&lt;br /&gt;&lt;li&gt;If major collection is not triggered by LOS full, LOS will be collected with mark-sweep. &lt;br /&gt;(&lt;b&gt;Note:&lt;/b&gt; Major collection itself is way too expensive compared to LOS mark-sweep. I think we can even compact LOS in major collection with only marginal overhead increase. Mark-sweep does not de-fragment LOS, which loses a good chance of space quality optimization. The real reason is, although we want mark-sweep on LOS to avoid large objects movement, we already achieve that benefit with the frequent minor collections, where only the NOS objects are moved.)&lt;br /&gt;&lt;/ol&gt;After thinking over the original design philosophy and our intended effects, we believe we made improper design decisions. We'd have following new decisions:&lt;ol&gt;&lt;br /&gt;&lt;li&gt;LOS is collected in both minor and major collection. &lt;br /&gt;&lt;li&gt;LOS full triggers minor collection. If the minor collection does not satisfy LOS allocation request, a major collection will be activated as a fallback. &lt;br /&gt;&lt;li&gt;LOS is collected with mark-sweep in minor collection and with compaction in major collection.&lt;/ol&gt;&lt;br /&gt;The idea of the new design is pretty clear: LOS benefit is not achieved by not collecting it, but by mark-sweeping it. This is the lesson what we learned.&lt;br /&gt;One caveat is, LOS mark-sweep in minor collection assumes it only piggybacks the sweeping phase, because non-gen mode minor collection traces the entire heap. But this is not true for generational mode, where only NOS is traced. So for generational mode, we will still not collect LOS as before. It should be studied if this is a serious issue. One possible solution is to have NOS for large objects as well.&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/06/adaptive-los-size-adjustment-at-runtime.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1154012616533089424?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1154012616533089424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1154012616533089424' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1154012616533089424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1154012616533089424'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/improve-los-performance-in-harmony-gcv5.html' title='Improve LOS performance in Harmony GCv5'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3198787404102267138</id><published>2007-06-30T00:32:00.000-07:00</published><updated>2007-06-30T03:49:41.604-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Re-examine the LOS design in Harmony GCv5</title><content type='html'>In last few blog entries, I introduced some LOS design considerations in Apache Harmony GCv5. We found that to have a separate LOS might cause big trouble for high performance GC. The major issue is the LOS size adjustment for best heap utilization. &lt;ol&gt;&lt;br /&gt;&lt;li&gt;LOS adjustment always requires a major collection so as to re-partition the heap for LOS and non-LOS. When LOS size varies significantly during the application execution, the frequently triggered major collections degrade overall performance.&lt;br /&gt;&lt;li&gt;When LOS extends, non-LOS size is reduced. We need estimate the maximal reduction size for non-LOS in order to accommodate its live objects. The estimated size has to conservative, otherwise the outofmemory in collection is fatal. But the conservation may cause non-LOS not fully utilized. We can do trial non-LOS shrinks to secure the reduction, but that causes overhead and complexity. (LOS shrink has not the over-reduction issue, because it is managed with free lists, so there are not the fragmentation issues in non-LOS block-based space.)&lt;br /&gt;&lt;li&gt;Our design decision that LOS is only collected in major collections looks too strong. For some LOS-intensive applications, although lots of large objects have short life span, GCv5 does not collect them promptly, then LOS gets full quickly and triggers major collections frequently.&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;For (1) above, I am afraid there is no good solution with current design, because the LOS size adjustment is unavoidable sometimes due to workloads characteristcs. We have some other design choices to solve it:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Do not use separate LOS; instead, allocate large objects in a large block (i.e., multiple  adjacent blocks). This may not be hard for GCv5 Mspace to adopt this design.  &lt;br /&gt;&lt;li&gt;Still have separate LOS, but LOS and non-LOS grow to each other. The faster growing space will naturally take the space in between. This design has a problem for generational collection, where NOS stays in one end of the heap. In this design, NOS will have to stay in between LOS and MOS. I guess we can overcome this issue by dynamically change the space layout in the heap, e.g., put the NOS in one end for generational mode, while in between for non-gen mode.&lt;br /&gt;&lt;li&gt;Yet another idea is not to give LOS a pre-allocated space, but use mmap service to get space on demand. This design may not really simplify the issue.&lt;/ul&gt;&lt;br /&gt;The sub-utilization problem of (2) above is hard to be solved if we do not change the LOS design. Trial non-LOS reduction is not desirable. Besides the overhead and complexity issues, it may not ultimately solve the sub-utilization problem, since its first successful trial might have already a too conservative size estimation.&lt;br /&gt;&lt;br /&gt;The problem (3) above can be improved without changing the fundamental design. The issue is mainly caused by the design decision that LOS is only collected in major collections. (Note, this is not only Harmony GCv5's own choice. As I know, some other GCs have the same design choice.) I will discuss this topic in next blog entry, since it is only incremental improvement over existing design.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3198787404102267138?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3198787404102267138/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3198787404102267138' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3198787404102267138'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3198787404102267138'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/re-examine-los-design-in-harmony-gcv5.html' title='Re-examine the LOS design in Harmony GCv5'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5905684750217380076</id><published>2007-06-29T06:58:00.000-07:00</published><updated>2007-06-30T03:04:43.780-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Some ideas for LOS design</title><content type='html'>Due to the complexity of LOS design for high performance, we probably want to give up the idea of LOS at all, i.e., not to have a separate space of large objects. On the other hand, I have some other ideas for LOS design.&lt;br /&gt;&lt;br /&gt;1. &lt;b&gt;Use OS VM service for large object movement&lt;/b&gt;. Since the fundamental reason for LOS is to eliminate the large objects copying overhead, we can do copying without real memory movement. The idea is to remap the large objects to target address with virtual memory service. Current OS does not provide this service yet, but it's actually trivial to modify OS kernel to support virtual address remapping. Then we can collect large objects with moving collector without the memory copying overhead (but memory remapping overhead).&lt;br /&gt;&lt;br /&gt;2. &lt;b&gt;Partition large object into small segments&lt;/b&gt;. If a GC doesn't move large objects, it can lead to fragmentation. We can partition a large object into fixed-size small segments, and the segments do not need to stay together. With this arrangement, the fragmentation can be eliminated, thus there is no need to compact LOS. The segments of a large object can simply be treated as a subtree of objects. This idea requires JIT support to translate the large object field access into a segment access.&lt;br /&gt;&lt;br /&gt;There is a paper specifically on LOS design [1]. But it does not talk anything about the space size adjustment, nor the LOS fragmentation issue. &lt;br /&gt;&lt;br /&gt;Some GC designs do not pre-allocate a space for LOS, but allocate large objects on-demand. In this way, LOS is never full. But I do not think this is a serious design for product GC. &lt;br /&gt;&lt;br /&gt;Some early work with LOS distinguished the large objects that have no references and put them into a separate LOS so that they don't need to be scanned. While this design might benefit some special GC algorithm, I do not see much of its value in JVM.&lt;br /&gt;&lt;br /&gt;[1] Michael Hicks, Luke Hornof, Jonathan T. Moore, Scott M. Nettles, A Study of Large Object Spaces, ISMM1998.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5905684750217380076?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5905684750217380076/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5905684750217380076' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5905684750217380076'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5905684750217380076'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/some-ideas-for-los-design.html' title='Some ideas for LOS design'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5200262044947503785</id><published>2007-06-28T07:31:00.000-07:00</published><updated>2007-06-30T00:36:23.140-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Design issues in LOS extension: allocation speed estimation</title><content type='html'>As I discussed in last blog article [1], the idea for LOS extension heuristic is intuitive, but some caveats in implementation should be fully understood. I have covered the issue with non-LOS space size estimation. In this essay, I will discuss how to compute the allocation speed [2].&lt;br /&gt;&lt;br /&gt;Allocation speed of a space is easy to compute if the space is flat, which is just the allocated bytes between two collections. But this is not the case for non-LOS space of Harmony GCv5, where the non-LOS space consists of NOS (nursery object space) and MOS (mature object space). Before two major collections, there could happen many times of minor collections. In GCv5, a minor collection copies NOS surviving objects to MOS, so MOS will become more and more filled until finally its free space is not able to hold NOS survivors, when a major collection should be triggered to collect MOS (actually the whole heap). (Note, the major collection is triggered with a more intelligent heuristic in GCv5. Here the mechanism I describe is much simplified.) &lt;br /&gt;&lt;br /&gt;When a major collection is triggered, the non-LOS space has been partially collected many times. Then the question is, what is the allocation speed of non-LOS space?&lt;br /&gt;&lt;br /&gt;In GCv5, we use the maximal MOS size to approximate the allocation speed of a thought flat non-LOS. This is reasonable. Between two major collections, we can think of the thought flat non-LOS gets filled gradually and triggers a collection when it reaches the maximal MOS size. The MOS size difference right after a collection and right before a collection can be regarded as the total allocated bytes of the thought non-LOS space. &lt;br /&gt;&lt;br /&gt;We compute the allocation speed of LOS as the live objects sum size difference right after and before a collection. Then we partition the heap to LOS and non-LOS by giving them free space sizes proportional to their respective allocation speeds. But the free space given to non-LOS is actually given to MOS, so that we expect MOS and LOS get full at the same time. The overall non-LOS size is the addition of MOS and NOS. (The NOS size is the same as that of before the major collection. But since NOS and MOS will partition the non-LOS size between them, this NOS size does not make much sense. We only need know how much free space is given to LOS, then the overall non-LOS size is the total heap size subtracted by the new LOS size.)&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/06/design-issues-in-los-extension-non-los.html&lt;br /&gt;[2]http://xiao-feng.blogspot.com/2007/06/adaptive-los-size-adjustment-at-runtime.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5200262044947503785?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5200262044947503785/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5200262044947503785' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5200262044947503785'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5200262044947503785'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/design-issues-in-los-extension.html' title='Design issues in LOS extension: allocation speed estimation'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1033311465416760217</id><published>2007-06-28T01:56:00.000-07:00</published><updated>2007-06-30T00:36:23.141-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Design issues in LOS extension: non-LOS size estimation</title><content type='html'>In last blog entry [1], I discussed the heuristic for adaptive LOS size adjustment at runtime. In real implementation of the heuristic, there are two important issues to consider. In this article, I will discuss the first issue on non-LOS space size estimation.&lt;br /&gt;&lt;br /&gt;When we extend LOS size in a major collection, we shrink the non-LOS size. We must ensure that the reduced non-LOS size be large enough to hold live objects in it. But how to? This is not obvious. We can compute the sum size of all the live objects in it during heap tracing, but this size is only a minimal one, may not be enough, because of two reasons:&lt;ul&gt; &lt;br /&gt;&lt;li&gt;Non-LOS heap is arranged in blocks. Each block can be wasted a little bit at its end due to the left remaining free area is too small to hold the live object. The live object has to be put in next block, leading to some space waste in this block. The accumulated result of this waste can be big, depending on block size, large object threshold size, and live object size distribution. &lt;br /&gt;&lt;li&gt;Non-LOS heap is collected by multiple collector threads. The collectors grab blocks one by one for compaction. Block is the basic compaction unit, one collector cannot move its objects to another collector’s block. Due to the parallel load balance design, the last few compacted blocks might be partially filled. This effect also leads to space waste.&lt;/ul&gt;&lt;br /&gt;After many times of experiments, we currently use a statistical approach for the non-LOS size estimation. Basically GC adds some additional space size on the non-LOS live objects sum size. The additional size is composed of two parts, to compensate for the wasted spaces caused by the two reasons above respectively. &lt;br /&gt;&lt;br /&gt;For the block fragmentation issue, the GC firstly collects the live objects distribution during heap tracing, and then computes the ratio of each size range in total non-LOS live size. This ratio is used as the probability of a live object appearing in the end of a block. For example, if we partition the size in 64-byte granularity, and the large object size threshold is 4KB, then there are 4K/64 = 64 size ranges, range &lt;b&gt;Ri&lt;/b&gt; for objects with i*64 &lt; size &lt;= (i+1)*64. We use &lt;b&gt;Si&lt;/b&gt; for the upper value of the size range, (i+1)*64. &lt;b&gt;Pi&lt;/b&gt; is the probability of an object in range Ri appearing in the end of a block. We need to leave each block extra space of size = SUM (&lt;b&gt;Pi * Si&lt;/b&gt; ). Then we multiply this value with the number of live blocks to get the total reserved space due to block fragmentation.&lt;br /&gt;&lt;br /&gt;For the parallel collector waste issue, the GC simply leaves two blocks for each one collector. &lt;br /&gt;&lt;br /&gt;The GC adds up the two reserved space sizes with the non-LOS live objects sum size to be the minimum requirement of non-LOS space size. That is, LOS extension cannot reduce the non-LOS space size to be smaller than this number. If this condition cannot be satisfied, the GC either cuts the LOS extension budget, or simply returns without LOS extension. If the LOS extension is expected for a failed very large object allocation, no extension will lead to heap OutOfMemory. &lt;br /&gt;&lt;br /&gt;The other important issue in LOS extension adaptation is how to compute the allocation speed of non-LOS space, which I will discuss in next entry.&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/06/adaptive-los-size-adjustment-at-runtime.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1033311465416760217?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1033311465416760217/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1033311465416760217' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1033311465416760217'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1033311465416760217'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/design-issues-in-los-extension-non-los.html' title='Design issues in LOS extension: non-LOS size estimation'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4632486271632912076</id><published>2007-06-28T00:01:00.000-07:00</published><updated>2007-06-30T00:36:23.141-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Adaptive LOS size adjustment at runtime</title><content type='html'>To adjust LOS size dynamically, Harmony GCv5 has developed a theory for the heuristic. &lt;br /&gt;&lt;br /&gt;In GCv5 initial design (it is changing currently), I made a design decision that, minor collection collects only NOS (nursery object space, for non-large object allocation), and major collection collects the whole heap; a full NOS will trigger only minor collection, and full LOS will trigger major collection. The rationalities are easy to understand:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;In common Java applications, non-large objects dominate the heap. It is expected that NOS will get full much more frequently than LOS gets full. In most cases when NOS is full, LOS will not be full; there is no need to collect LOS in minor collection. It was believed ok to collect LOS only in major collections.&lt;br /&gt;&lt;li&gt;We expect minor collections happen frequently and finish quickly every time, so minor collection in GCv5 is fully parallelized for NOS collection. We don’t want LOS collection to be involved as a separate phase hence impact minor collection’s throughput.&lt;br /&gt;&lt;li&gt;LOS collection requires to trace the whole heap for live objects, whose overhead is high already to be worth more. More importantly, LOS collection might not be able to satisfy the failed large object allocation, which means a LOS extension that must be done together with non-LOS space shrink. The shrink requires a collection to give out the space.&lt;br /&gt;&lt;li&gt;In generational collection, pointers from non-NOS to NOS are recorded in remember set. It is straightforward to collect only NOS in minor collections. To collect LOS in minor collection counteracts with the generational concept, unless LOS is treated as young generation as well.&lt;/ul&gt;&lt;br /&gt;With the design decision in mind, I came up with a very simple and intuitive idea for space size adjustment heuristic. That is, every time when a major collection happens, LOS and non-LOS are almost equally full. Then we developed a theory based on this idea. We introduced a concept called allocation speed, which is the allocated bytes per unit time, and can be measured as the total allocated bytes between two collections. We adjust the LOS and non-LOS heap partition to ensure that their sizes are proportional to their allocation speeds respectively. &lt;br /&gt;&lt;br /&gt;This heuristic works well as long as the allocation speed is rather stable in continuous major collections. But we found it has some problems, so we changed the design recently, while the idea of the heuristic still holds. The new design will be discussed later.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4632486271632912076?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4632486271632912076/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4632486271632912076' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4632486271632912076'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4632486271632912076'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/adaptive-los-size-adjustment-at-runtime.html' title='Adaptive LOS size adjustment at runtime'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7616160042380213824</id><published>2007-06-27T20:50:00.000-07:00</published><updated>2007-06-30T00:35:51.536-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Deal with large objects in GC: LOS or no LOS?</title><content type='html'>Some applications in Dacapo create lots of large objects, which requires GC to handle them efficiently. While there is no strict size threshold for large objects, I am used to take those larger than a couple of KBs as large objects. Depending on the context, the size threshold can be 1KB, 4KB, or 64KB, or even more. &lt;br /&gt;&lt;br /&gt;An interesting topic is, whether GC wants to manage large objects separately or not. I found this is not an easy topic as commonly believed. If one reads many GC literatures, she can quickly find that it's a folk belief that a separate large object space (LOS) is the way to go. But in my experience with Harmony GC development, I found it is not so obvious to have a LOS in GC. &lt;br /&gt;&lt;br /&gt;Normally, a GC has a separate LOS because of two reasons:&lt;ul&gt;&lt;br /&gt;&lt;li&gt;&lt;b&gt;GC does not want to move the large objects during collections&lt;/b&gt;.  Large objects are usually managed with a non-moving algorithm to avoid the large memory-copying overhead. Since non-large objects are usually moveable during collections, it would be reasonable to separate the management of large and non-large objects. For example, LOS can be collected with mark-sweep algorithm.&lt;br /&gt;&lt;li&gt;&lt;b&gt;The heap spaces are managed in unit&lt;/b&gt;. It’s quite normal for the GC to arrange the heap into blocks or chunks, which may not be able to hold an arbitrarily-sized large object. There can be two solutions for this: a) to use multiple continuous blocks to form a large block that can accommodate the large object; b) not to allocate large objects in the block or chunk space. Well yet another solution is, c) not to use block layout for the heap from the beginning. &lt;/ul&gt;&lt;br /&gt;Harmony GCv4 uses solution a) by forming large blocks for large objects; Harmony GCv4.1 adopts solution c) that has no block or chunk concept at all. Both a) and c) do not have separate LOS, but they can support pinned-objects so as to avoid moving the large objects. Pinned-object fragments the space and disturbs the smooth moving collection algorithm with extra overhead. I personally hate to have pinned objects spotting in a moveable space. Since Harmony GCv5 arranges its common heap in blocks, it is then natural for me to choose a separate LOS in the design. &lt;br /&gt; &lt;br /&gt;At the beginning LOS in GCv5 worked very well, but soon it started to become a trouble maker when GCv5 met various stress tests and performance benchmarks. One of the major issues we met is, with the separate LOS space, we have to give it flexibility to grow or shrink in size; otherwise, the performance would suck. While it is not hard to implement a LOS with adaptive size, the hard part is to have a high performance heuristic for the runtime adaptation. But it is not so obvious. I will discuss more about Harmony GCv5 LOS design in following blog articles.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7616160042380213824?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7616160042380213824/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7616160042380213824' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7616160042380213824'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7616160042380213824'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/deal-with-large-objects-in-gc-los-or-no.html' title='Deal with large objects in GC: LOS or no LOS?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7804414153916058424</id><published>2007-06-14T00:18:00.000-07:00</published><updated>2007-06-16T22:41:48.608-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Programming'/><category scheme='http://www.blogger.com/atom/ns#' term='Multi-core'/><title type='text'>Gap between programming model and processor design</title><content type='html'>As everyone sees that programming is going to higher level with scripting languages, processor design with multiple cores wants to pull the programming back to lower level. What's the problem?&lt;br /&gt;&lt;br /&gt;I had and have spent lots of time in parallel computation. Back to early 90's when I was in school, I learned and programmed on INMOS Transputer with a language called Occam. The programs I wrote were only for research purpose, such as TSP, LU-decomposition, etc. At that time, I never thought whether this kind of parallel programming model would be useful for common programmers, esp. the model of CSP. &lt;br /&gt;&lt;br /&gt;The common programming model (a mix of paradigms and languages) evolved, in my observation, from structural to OO, then to component-based for a while, then to Java, then to scripting. (Yes, scripting can be structural or OO or whatever, I won't want to argue on the concept of paradigm and language.) It's clear to me that the programming task is becoming more and more easy in sense that the programmers care less and less about the systems the programs running on. &lt;br /&gt;&lt;br /&gt;Along with this evolution, we almost always write sequential programs in whatever programming models. It is not because the languages do not support parallel constructs. Instead, C has pthread and OpenMP, Java has java.lang.Thread and monitor, Cilk has task, there are also HPF, MPI, and recently Chapel, X10, and Fortress, etc. But as I know only few people/programs really use them. Some people argue that is because we human beings only think serially, while there are groups of people having opposite opinion. For example, one founder of INMOS spoke recently [1], that "Parallelism isn’t difficult, it is just that people, particularly computer scientists, by training and by inclination, are prejudiced against it. Programmers are trained to think serial and to work within the limitations of sequential processing. Electronics engineers, by contrast, should be more open, since when they design a chip or an application, they are designing an inherently parallel system, so they are trained to think parallel."&lt;br /&gt;&lt;br /&gt;Personally I do not agree with Iann Barron, although I wish he is right. I learned a little bit about electronics, and I thought the asynchronous design is much harder than its synchronous counterpart. &lt;br /&gt;&lt;br /&gt;On the other hand, microprocessor design is becoming multiple-core, well there is no magic to turn a sequential program into parallel (except for certain special cases).     To utilize the multicore processor, the programs have to be programmed to express the parallelisms. This requires the programmers to consider low level processor properties such as number of cores, core topology, inter-communications, etc. The gap between processor design and programming model is getting bigger. Do we have any good solutions?&lt;br /&gt;&lt;br /&gt;This gap leads me to think of a very similar situation in microprocessor design, where the gap between processor speed and memory speed is becoming bigger. People knew this trend for years, but the only realistic solution is to increase on-chip cache, in both size and number of levels. In today's modern microprocessors, cache consumes most of the die area. Some people argue that this is because of the Von Neumann model is bad, which is essentially a load-store machine. They believe data-driven model is more intuitive to both executed tasks and processor design, hence can beat up Von Neumann model. But so far none of the data-driven processors has got reasonable success. And the widening gap between processor and memory speed is still there. &lt;br /&gt;&lt;br /&gt;It's sometimes stupid to blame the common users/programmers/consumers when you fail to push some "cool" idea to them. I really can't accept Iann Barron's claim of "they are all wrong!". Maybe finally (in 5-10 years, for example), the solution is some data-driven/data-flow processor with CSP/data-flow programming model; I don't think it proves we are wrong today. Maybe somebody can say at the time that "I invented it a decade ago"; it does not make much sense to me. One simple reason is, one has to do right thing at right time.&lt;br /&gt;&lt;br /&gt;[1]&lt;a href=http://www.embeddedtechjournal.com/articles_2007/20070612_parallel.htm&gt;Parallel Processing Considered Not Harmful.&lt;/a&gt; 2007.6.12.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7804414153916058424?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7804414153916058424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7804414153916058424' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7804414153916058424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7804414153916058424'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/06/gap-between-programming-trend-and.html' title='Gap between programming model and processor design'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-6697611032926125161</id><published>2007-05-24T00:11:00.000-07:00</published><updated>2007-06-16T22:40:10.275-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Weak reference processing in Apache Harmony</title><content type='html'>To continue the last blog entry [1], I will briefly introduce the weak reference processing in Apache Harmony. There is a good material on this topic at a doxygen-generated page [2]. My description here would be very similar to that page, since the design is almost the same in both GCs (GCv4.1 and GCv5) of Harmony. To some extent, my description can be viewed as sort of a rephrase of that page, although it's my own understanding and based on GCv5's design. &lt;ol&gt;&lt;br /&gt;&lt;li&gt;During heap tracing, mark all the reachable objects except referents. At the same time, build three lists to record the marked reference objects, one list for one reference type;&lt;br /&gt;&lt;li&gt;Process Soft Reference objects specially during heap tracing in minor collection. In minor collection, SoftRef objects are treated as normal ones that their referents are marked as well, thus there are actually only two lists of marked reference objects. &lt;br /&gt;&lt;li&gt;Process Soft Reference objects after heap tracing (only in major collection). The live SoftRef object list is traversed. Every referent field is checked if the referent object is unmarked (dead). For the dead referent, the field in SoftRef object is cleared; otherwise, the SoftRef object with live referent is removed from the list.&lt;br /&gt;&lt;li&gt;Process Weak Reference objects after SoftRef processing in the same way as described above. The reason to have this order (process WeakRef after SoftRef) is, some objects might be weakly reachable in a path that has more than one Reference objects.&lt;br /&gt;&lt;li&gt;Process finalizable objects. Traverse the Finalizer object list. (Objects in the list were added when they are created). Trace the heap from the dead objects in the list to resurrect all reachable objects from them. These include the Reference objects. In GCv5 current design, those resurrected Reference objects are not put into its type list. (There is no specification on this part. Maybe it's better to put them into the list as well, since they are live anyway.) The finalizable objects (dead but now resurrected) in the list are removed from the Finalizer object list, and put into a Finalizable object list.&lt;br /&gt;&lt;li&gt;Process Phantom Reference objects at the moment same as other Reference processing. PhanRef processing is ordered after finalization because it must treat the resurrected objects as live ones. This is a good feature for the Java to have a broader view of live objects in the whole system, including those only accessed in finalizer.&lt;br /&gt;&lt;li&gt; All the remaining list items of the live reference objects are passed to VM for further handling. &lt;br /&gt;&lt;li&gt; After the mutators are resumed, the reference objects will be called on their enqueue() methods. This is processed with some dedicated thread(s), which can be regarded as mutator thread(s).&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;In current Apache Harmony implementation, threading pool is managed with WeakReference, so that the finished threads entities can be reused after they are enqueued. This has a subtle issue that, since the specification has no requirement on the timing of WeakReference enqueuing operations (the last step above). We cannot guarantee the finished threads can be enqueued timely. They causes memory leak problem.&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/05/weak-reference-in-jvm.html&lt;br /&gt;[2]&lt;a href="http://harmony.apache.org/subcomponents/drlvm/doxygen/gc_gen/html/gc_finalization_and_weak_refs.html"&gt;Finalization and weak references design in GC&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-6697611032926125161?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/6697611032926125161/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=6697611032926125161' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6697611032926125161'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6697611032926125161'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/weak-reference-processing-in-apache.html' title='Weak reference processing in Apache Harmony'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-2689730062090249192</id><published>2007-05-23T08:06:00.000-07:00</published><updated>2007-06-16T22:05:22.340-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Weak reference in JVM</title><content type='html'>Weak reference is a general name for three different reference types in Java: software reference, weak reference, and phantom reference. Java introduces weak reference to give programmer an explicit way to manage objects' lifetime (in my understanding). A weak reference is not a common Java object reference, it's a Java object of reference type. &lt;br /&gt;&lt;br /&gt;Since in Java objects' lifetimes are managed automatically by garbage collector, it is inconvenient or impossible for a programmer to know if an object is dead: If the object is referenced by the application, it is live; when it is dead, there is no reference in the application to the object. That is, when you know it, it must be live; when it is dead, you never know it then. But sometimes, a programmer wants to know if an object is dead; or the programmer wants to see the object even when it is dead. One most commonly given example for this is "cache". My browser keeps a cache for the pages I have visited, so that next time when I visit the page again, the contents can be loaded directly from the cache if they are not expired. The cache contents are virtually dead in the sense that they can be cleaned without any problem. But we still can view them, and even resurrect them when needed. If the cache mechanism is written in Java, weak reference is a handy tool.&lt;br /&gt;&lt;br /&gt;Weak reference in some sense is a pointer to an object, but this pointer itself is represented in a Java object. This reference object has a field holding a reference to the target object, which is called &lt;i&gt;referent&lt;/i&gt;. If an object can only be accessed through this weak reference object, that means this object is actually dead, and subject to GC's arbitrary disposal. In this case, this object is called "weakly reachable". The application can still reach this object before GC reclaims it. If the object is reclaimed, the weak reference object finishes its duty, and the application can decide if reuse it to manage other object or forget it. When the referent is reclaimed by GC, the referent field in the weak reference object will be set to NULL, which is called "cleared", and the reference object will be put into a queue if there is one registered. The programmer can use the queue to manage reference objects in group.  &lt;br /&gt;&lt;br /&gt;The implementation of weak reference is straightforward. A weak reference object cannot be treated as a common object during object scanning, because it actually only represents a pointer. So its reference to the referent is not considered a real reference (but a weakly reachable link). GC will not mark its referent as live object. Only if the referent is reached from a non-weakly reachable path, can it be marked live. So the steps for weak reference processing are something like below&lt;ol&gt;:&lt;br /&gt;&lt;li&gt; During heap tracing, mark all reachable objects except referents (weakly reachable objects). At the same time, all the weak reference objects are recorded in a list.&lt;br /&gt;&lt;li&gt; After heap tracing, go through the reference object list built in last step, check if any reference objects' referents are dead. If a referent is dead, the referent field in the reference object is set NULL (cleared).&lt;br /&gt;&lt;li&gt; All the cleared reference objects are passed to VM, to be enqueued by executing their enqueue() methods.&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;The processing steps here are simplified since the three weak reference types (soft, weak, phantom) are not distinguished. I will continue to discuss them in next blog entry.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-2689730062090249192?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/2689730062090249192/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=2689730062090249192' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2689730062090249192'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2689730062090249192'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/weak-reference-in-jvm.html' title='Weak reference in JVM'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3181646928112936694</id><published>2007-05-22T03:13:00.000-07:00</published><updated>2007-05-22T04:48:40.309-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>How to create &gt;1GB heap size in 32-bit Windows?</title><content type='html'>I should have this recorded here. This is a nice reply by Aleksey Shipilev to my question on how to create 1.5GB heap size for Harmony.&lt;br /&gt;&lt;br /&gt;============================================&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;I will explain the trick "How to get DRL VM acquire &gt;1Gb of heap".&lt;br /&gt;&lt;br /&gt;The problem is predefined base address in some of the libraries: they are going to&lt;br /&gt;load at predefined location in system memory thus causing fragmentation of possible &lt;br /&gt;heap space. Since there are problems on allocating non-continuous heap, there's no &lt;br /&gt;possibility to allocate big chunk of memory. So we will need to relocate some &lt;br /&gt;libraries to another location.&lt;br /&gt;&lt;br /&gt;This time we have only by-hand solution, which could be machine-dependent. The idea &lt;br /&gt;is simple: try to allocate as much as we can and see what blocks us. I've used the &lt;br /&gt;simple test:&lt;br /&gt;&lt;br /&gt;  public class HeapTest {&lt;br /&gt;    public static void main(String args[]) throws Exception {&lt;br /&gt;      System.out.println("HeapTest started");&lt;br /&gt;      System.in.read();&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;Then I run this test with max of possible heap:&lt;br /&gt;&lt;br /&gt; $harmony-hdk-r532358/jdk/jre/bin/java -Xms900M -Xmx900M -XX:vm.dlls=gc_gen.dll &lt;br /&gt;-XX:gc.use_large_page=true HeapTest&lt;br /&gt;&lt;br /&gt;...and see the DLL distribution across the memory: you could use ProcessExplorer from &lt;br /&gt;SysInternals.com to obtain that list. I have this picture:&lt;br /&gt;&lt;br /&gt;  Name            Base            Size&lt;br /&gt;  unicode.nls     0x260000        0x16000&lt;br /&gt;  locale.nls      0x280000        0x34000&lt;br /&gt;  sortkey.nls     0x2C0000        0x41000&lt;br /&gt;  sorttbls.nls    0x310000        0x6000&lt;br /&gt;  ctype.nls       0x330000        0x3000&lt;br /&gt;  zlib1.dll       0x3A0000        0x13000&lt;br /&gt;  odbc32.dll      0x3C0000        0x3D000&lt;br /&gt;  java.exe        0x400000        0xD000&lt;br /&gt;  harmonyvm.dll   0x510000        0x424000&lt;br /&gt;  dbghelp.dll     0x940000        0xA8000&lt;br /&gt;  odbcint.dll     0x11E0000       0x17000&lt;br /&gt;  em.dll          0x1330000       0x40000&lt;br /&gt;  jitrino.dll     0x1380000       0x410000&lt;br /&gt;  gc_gen.dll      0x17A0000       0x2C000&lt;br /&gt;  hysig.dll       0x17E0000       0x6000&lt;br /&gt;  hytext.dll      0x17F0000       0x6000&lt;br /&gt;  hyzlib.dll      0x1E70000       0x13000&lt;br /&gt;  vmi.dll 0x1E90000       0x6000&lt;br /&gt;  hynio.dll       0x1EA0000       0x6000&lt;br /&gt;  hyluni.dll      0x1EB0000       0x23000&lt;br /&gt;  hyarchive.dll   0x1EE0000       0xD000&lt;br /&gt;  icuinterface34.dll      0x25F0000       0x17000&lt;br /&gt;  hythr.dll       0x10000000      0x407000&lt;br /&gt;  hyprt.dll       0x11100000      0x18000&lt;br /&gt;  [ ------------------ here goes the chunk ------------------ ]&lt;br /&gt;  icuuc34.dll     0x4A800000      0xC8000&lt;br /&gt;  icuin34.dll     0x4A900000      0xAA000&lt;br /&gt;  icudt34.dll     0x4AD00000      0x870000&lt;br /&gt;  [ --------------- and here goes the chunk -------------- ]&lt;br /&gt;  mswsock.dll     0x71B20000      0x41000&lt;br /&gt;  ws2help.dll     0x71BF0000      0x8000&lt;br /&gt;  ws2_32.dll      0x71C00000      0x17000&lt;br /&gt;  comdlg32.dll    0x762B0000      0x4A000&lt;br /&gt;  userenv.dll     0x76920000      0xC4000&lt;br /&gt;  psapi.dll       0x76B70000      0xB000&lt;br /&gt;  secur32.dll     0x76F50000      0x13000&lt;br /&gt;  user32.dll      0x77380000      0x92000&lt;br /&gt;  comctl32.dll    0x77420000      0x103000&lt;br /&gt;  comctl32.dll    0x77530000      0x97000&lt;br /&gt;  version.dll     0x77B90000      0x8000&lt;br /&gt;  msvcrt.dll      0x77BA0000      0x5A000&lt;br /&gt;  gdi32.dll       0x77C00000      0x48000&lt;br /&gt;  rpcrt4.dll      0x77C50000      0x9F000&lt;br /&gt;  shlwapi.dll     0x77DA0000      0x52000&lt;br /&gt;  kernel32.dll    0x77E40000      0x102000&lt;br /&gt;  advapi32.dll    0x77F50000      0x9C000&lt;br /&gt;  msvcr71.dll     0x7C340000      0x56000&lt;br /&gt;  ntdll.dll       0x7C800000      0xC0000&lt;br /&gt;  shell32.dll     0x7C8D0000      0x803000&lt;br /&gt;&lt;br /&gt;Let's try to merge these chunks together. We will use the editbin utility from MS &lt;br /&gt;Platform SDK. I checked that editbin is on my $PATH and then run the following &lt;br /&gt;script:&lt;br /&gt;&lt;br /&gt;  editbin /LARGEADDRESSAWARE java.exe&lt;br /&gt;&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84000000 hythr.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84500000 hysig.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84550000 hyprt.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84600000 hyzlib.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84650000 hytext.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84700000 vmi.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84750000 hyluni.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84800000 hyarchive.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84850000 hynio.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x84900000 hycharset.dll&lt;br /&gt;&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x85500000 gc_cc.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x85500000 gc_gen.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x85600000 harmonyvm.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x86100000 zlib1.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x86200000 em.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x86300000 jitrino.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /rebase:base=0x87000000 hysecurity.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /REBASE:BASE=0x87030000 icuuc34.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /REBASE:BASE=0x87100000 icudt34.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /REBASE:BASE=0x87200000 icuin34.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /REBASE:BASE=0x87300000 icuin34.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /REBASE:BASE=0x87400000 icuin34.dll&lt;br /&gt;  editbin /LARGEADDRESSAWARE /REBASE:BASE=0x87500000 ICUInterface34.dll&lt;br /&gt;&lt;br /&gt;either on jre/bin and jre/bin/default directories.&lt;br /&gt;&lt;br /&gt;The idea is simple too: we are moving the libraries at the end of memory and base &lt;br /&gt;them there. Note that this script is really OS/build dependent since the initial &lt;br /&gt;distribution is unknown.&lt;br /&gt;&lt;br /&gt;After applying these transformations I was able to run the test again with larger &lt;br /&gt;heap:&lt;br /&gt;&lt;br /&gt;  $ harmony-hdk-r532358/jdk/jre/bin/java -Xms1700M -Xmx1700M -XX:vm.dlls=gc_gen.dll &lt;br /&gt;-XX:gc.use_large_page=true  HeapTest&lt;br /&gt;  &gt; GC use large pages.&lt;br /&gt;  &gt; HeapTest started&lt;br /&gt;&lt;br /&gt;Horray! Then I make sure that it worked out fine:&lt;br /&gt;&lt;br /&gt;  Name    Base    Size&lt;br /&gt;  unicode.nls     0x260000        0x16000&lt;br /&gt;  locale.nls      0x280000        0x34000&lt;br /&gt;  sortkey.nls     0x2C0000        0x41000&lt;br /&gt;  sorttbls.nls    0x310000        0x6000&lt;br /&gt;  ctype.nls       0x330000        0x3000&lt;br /&gt;  java.exe        0x400000        0xD000&lt;br /&gt;  odbcint.dll     0xCB0000        0x17000&lt;br /&gt;  [ ------------------ a BI-I-I-I-G chunk here --------------- ]&lt;br /&gt;  icuinterface34.dll      0x71AA0000      0x17000&lt;br /&gt;  mswsock.dll     0x71B20000      0x41000&lt;br /&gt;  ws2help.dll     0x71BF0000      0x8000&lt;br /&gt;  ws2_32.dll      0x71C00000      0x17000&lt;br /&gt;  icuin34.dll     0x72520000      0xAA000&lt;br /&gt;  comdlg32.dll    0x762B0000      0x4A000&lt;br /&gt;  userenv.dll     0x76920000      0xC4000&lt;br /&gt;  psapi.dll       0x76B70000      0xB000&lt;br /&gt;  secur32.dll     0x76F50000      0x13000&lt;br /&gt;  user32.dll      0x77380000      0x92000&lt;br /&gt;  comctl32.dll    0x77420000      0x103000&lt;br /&gt;  comctl32.dll    0x77530000      0x97000&lt;br /&gt;  version.dll     0x77B90000      0x8000&lt;br /&gt;  msvcrt.dll      0x77BA0000      0x5A000&lt;br /&gt;  gdi32.dll       0x77C00000      0x48000&lt;br /&gt;  rpcrt4.dll      0x77C50000      0x9F000&lt;br /&gt;  shlwapi.dll     0x77DA0000      0x52000&lt;br /&gt;  kernel32.dll    0x77E40000      0x102000&lt;br /&gt;  advapi32.dll    0x77F50000      0x9C000&lt;br /&gt;  msvcr71.dll     0x7C340000      0x56000&lt;br /&gt;  ntdll.dll       0x7C800000      0xC0000&lt;br /&gt;  shell32.dll     0x7C8D0000      0x803000&lt;br /&gt;  hythr.dll       0x84000000      0x407000&lt;br /&gt;  hysig.dll       0x84500000      0x6000&lt;br /&gt;  hyprt.dll       0x84550000      0x18000&lt;br /&gt;  hyzlib.dll      0x84600000      0x13000&lt;br /&gt;  hytext.dll      0x84650000      0x6000&lt;br /&gt;  vmi.dll 0x84700000      0x6000&lt;br /&gt;  hyluni.dll      0x84750000      0x23000&lt;br /&gt;  hyarchive.dll   0x84800000      0xD000&lt;br /&gt;  hynio.dll       0x84850000      0x6000&lt;br /&gt;  gc_gen.dll      0x85500000      0x2C000&lt;br /&gt;  harmonyvm.dll   0x85600000      0x424000&lt;br /&gt;  zlib1.dll       0x86100000      0x13000&lt;br /&gt;  em.dll  0x86200000      0x40000&lt;br /&gt;  jitrino.dll     0x86300000      0x410000&lt;br /&gt;  odbc32.dll      0x86800000      0x3D000&lt;br /&gt;  dbghelp.dll     0x86900000      0xA8000&lt;br /&gt;  icuuc34.dll     0x87030000      0xC8000&lt;br /&gt;  icudt34.dll     0x87100000      0x870000&lt;br /&gt;&lt;br /&gt;Hopefully, my script will work OOB. If there any problems, we will try to reiterate &lt;br /&gt;the moving one more time on another addresses. Note that relocating of MS Windows &lt;br /&gt;system libraries is the challenge (and one could consider this unfair) - since &lt;br /&gt;Windows will try to fight you :)&lt;br /&gt;&lt;br /&gt;Thanks,&lt;br /&gt;Aleksey Shipilev&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;============================================&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3181646928112936694?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3181646928112936694/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3181646928112936694' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3181646928112936694'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3181646928112936694'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/how-to-create-1gb-heap-size-in-32-bit.html' title='How to create &gt;1GB heap size in 32-bit Windows?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4788321460739932259</id><published>2007-05-22T00:50:00.000-07:00</published><updated>2007-05-22T04:48:40.310-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Force a full heap collection from outside of GC?</title><content type='html'>In the last blog entry [1], I discussed the problem of runtime memory leaks caused by native data structures. In this essay, I go from the other way around. That is, how to reclaim the native data structures promptly so as to avoid possible system crash.&lt;br /&gt;&lt;br /&gt;Taking again the fat lock reclamation as an example. Now we know how to reclaim the garbage fat locks during GC collection. But sometimes, because a GC is started, the fat locks already consume too much memory, that causes runtime malloc() failure. This is possible if the runtime leaves major part of its space to GC heap, then the rest space is not enough to accommodate all the runtime data, including the endless generations of fat locks. &lt;br /&gt;&lt;br /&gt;If all the fat locks are still alive at the moment of malloc failure, we have to face an out-of-memory, though not in GC heap. We can blame the application programmer who uses too many contended locks, or we can adjust the GC heap size to run the application. But in reality, this situation can virtually never happen. If a malloc() fails, that probably means too many dead fat locks kept in memory should be reclaimed now. &lt;br /&gt;&lt;br /&gt;The problem is, threading component can't simply trigger a collection, since collection is usually triggered by unsatisfied allocation, or only triggered by Java application through System.gc(). The reason for other components not to trigger a GC is, the context for a collection should be readily prepared; specifically, the stack should be unwindable and context should be enumerable for root set. This is not unsolvable anyway. We can carefully design the code to make the context ready then call into the GC.&lt;br /&gt;&lt;br /&gt;But there is another problem: A collection does not guarantee all the dead objects be reclaimed. GC usually decides by itself how to collect the heap, e.g., it may decide to do a minor collection. Without full heap collection, the threading manager cannot ensure the dead fat locks are cleaned. So the threading manager hopes to force the GC to do a full heap collection once malloc() fails.&lt;br /&gt;&lt;br /&gt;This is a new issue GC faces. Probably it should provide this interface to harmonize the space competition between native resource and application data.&lt;br /&gt;&lt;br /&gt;Probably this problem will go away if we create the fat locks in GC heap. But there are other native resources having the similar problem. For example, the file handles can run out in the application keeps creating objects that allocate new file handles; but most of the file handles could actually be recycled since their owning Java objects are dead already, just there is no collection to reclaim them.&lt;br /&gt;&lt;br /&gt;I believe this is a common issue in any resource management system. As long as the resources are partitioned, there can always be the situations where some partitions are run out while some others are almost intact. A high-level idea to solve the problem is not to partition the resources at all, so that all the consumers can use any part of the resources. This requires the resources be managed equally and uniformly. For example, the heap is used by both native and managed data equally; and the file handles and the memory are managed by the same component. &lt;br /&gt;&lt;br /&gt;The other high-level idea is to dynamically adjust the boundaries between the partitions. This is  actually a special case of first high-level idea, i.e., one partition (for one kind of resource), one manager (for all kinds of resources). A very interesting example is the space partitioning inside GC heap, which is a miniature of the situation. In Harmony GCv5, we partition the GC heap into NOS/MOS/LOS spaces, but in order to fully utilize the space, we need dynamically adjust their boundaries. Although each space has its own manager, the whole heap is managed by the top level GC. &lt;br /&gt;&lt;br /&gt;Back to the original topic, to force full heap collection in threading component can solve the fat lock issue, but we probably also want the application to be able to do the same thing. This requirement is asked by DirectBuffer management.&lt;br /&gt;&lt;br /&gt;DirectBuffer is a Java object that has a piece of native memory referred. The native memory can be large in size that quickly after a couple of DirectBuffer creation, the native space becomes too full to malloc any more. The Java code in this case may want to call into GC to force a full heap collection. &lt;br /&gt;&lt;br /&gt;If we view the runtime system as three parts: application, runtime, and GC, now we understand why the memory issue requires GC to expose its collection API.&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/05/native-resource-management-in-jvm.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4788321460739932259?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4788321460739932259/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4788321460739932259' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4788321460739932259'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4788321460739932259'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/force-full-heap-collection-from-outside.html' title='Force a full heap collection from outside of GC?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-8501824091993885335</id><published>2007-05-21T23:20:00.000-07:00</published><updated>2007-05-22T04:48:40.310-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Native resource management in JVM</title><content type='html'>Native resources here mainly refer to those system resources that are not directly managed by JVM, such as file handles, sockets, etc. One of the important native resources in runtime is native memory. This sounds contradictory to JVM's design spirit that the heap is managed automatically by the runtime, but they are actually different issues. &lt;br /&gt;&lt;br /&gt;In JVM, only the object heap is specified and is managed automatically. Heap is basically used for application data, not necessarily for the data structures of the runtime itself. The issues with native memory can be classified into three categories:&lt;ul&gt;&lt;br /&gt;&lt;li&gt; The data structures that are created and used by the runtime, such as the class table for loaded classes, string buffer for interned strings, code buffer for jitted code, etc. They are purely runtime data structures, can be managed by runtime as usual as in common C/C++ applications.&lt;br /&gt;&lt;li&gt; The data structures that are created for Java objects. Their lifetimes are associated with the Java objects. They are not defined or required by the application, such as the fat lock data structure in &lt;i&gt;thin-lock&lt;/i&gt; implementation of Java monitor. &lt;br /&gt;&lt;li&gt; The data structures that are asked or specified by the application, but are out of JVM's control, such as those system resources. DirectBuffer in Java also belongs to this category. DirectBuffer is a piece of native memory, defined in Java for I/O efficiency, so that the application can use it for system operations like file mapping.&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;All the the native resources mentioned above are memory. There are a couple of alternative solutions to manage them. In this essay, I will discuss the techniques for the second category; specifically, I will use fat lock as an example.&lt;br /&gt;&lt;br /&gt;Fat lock is a data structure that is created when a Java object is contended as a lock. The creation is called "lock inflation" in thin-lock. Fat lock can be accessed from its Java object, and the memory is freed only when the Java object is dead or the fat lock is deflated. In a JVM that doesn't consider the reclamation of fat locks of dead Java objects, the leaked memory may finally accumulate into serious problem.&lt;br /&gt;&lt;br /&gt;To reclaim the dead fat locks, the key is to know they are dead. Since fat locks are managed by the threading component, it needs the GC to inform it about the liveness of the fat locks. &lt;ul&gt;&lt;br /&gt;&lt;li&gt;A straightforward solution is for the GC to mark the live fat locks during object scanning. This can be a performance issue, since object scanning needs to check if the scanned objects have fat locks associated. &lt;br /&gt;&lt;li&gt;Another solution is, after the GC marking phase, when all the live objects are known, the threading component goes through all of its fat locks, and finds if any fat locks are dead because of their Java objects are not marked. This solution doesn't have the problem of the above one, but it requires the threading component can access the Java objects from their fat locks. This means, each fat lock should hold some reference to its Java object. This requirement has a drawback that, GC has to update those references in fat locks if their Java objects are moved during a collection. &lt;br /&gt;&lt;li&gt;A third solution would be to maintain weak references for those fat locks to their Java objects. During GC marking, the weak references are not followed, so that a fat lock can't keep its Java object alive. After GC marking, the reference queue is gone through to find the dead Java object referents. This approach has the same disadvantage as the second one, and has an additional drawback that, the GC has to maintain the reference queue.&lt;/ul&gt;&lt;br /&gt;The discussions above are not enough to understand all the issues with native resource management. More will be continued.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-8501824091993885335?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/8501824091993885335/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=8501824091993885335' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8501824091993885335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8501824091993885335'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/native-resource-management-in-jvm.html' title='Native resource management in JVM'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-6272664916190053139</id><published>2007-05-03T23:50:00.000-07:00</published><updated>2007-05-04T00:33:49.379-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Finalization subsystem design in Apache Harmony</title><content type='html'>The processings of finalizer and weak reference are similar and closely related. Here the finalizer subsystem includes also the weak reference support in GC, so I would call them finref subsystem interchangeably with finalizer subsystem. &lt;br /&gt;&lt;br /&gt;Finalizer processing includes following major activities: &lt;ol&gt;&lt;br /&gt;&lt;li&gt;&lt;b&gt;Remember all the objects that have finalizer&lt;/b&gt;. This is easy to be done at allocation time. I know some GC implementation finds those objects at collection time, which requires scanning of dead objects hence is undesirable in my opinion. To do it at allocation time increases a little bit the allocation path. But since we anyway need certain checkings in fast path allocation, it does not matter to incorporate one more checking for finalizer. &lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;b&gt;Identify the finalizable objects&lt;/b&gt;. Once GC marking phase is finished and all live objects are marked, the collector will go through the remember queue of objects with finalizer, checking if any objects are unreachable. Those unreachable objects in the queue are finalizable objects, and passed to VM for finalization. Before they are handed over to VM, the collector traces through those objects to resurrect all the recursively referenced objects from them. &lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;b&gt;Finalization of the finalizable objects&lt;/b&gt;. VM has a couple of finalizing threads sleeping waiting for new finalization tasks. When GC passes new finalizable objects to VM, the finalizing threads are waken up and start to invoke the finalize() method of those objects. These finalizing threads are native threads associated with Java thread objects, because they will act as Java threads when executing finalizers.&lt;/ol&gt; &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Finalization load balance&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;One tricky scenario needs special handling. If a mutator keeps producing objects with finalizer, and the finalizers are not able to be executed on time, the heap space will be consumed by dead objects waiting for being finalized. Then the application will cause Out-of-memory exception. &lt;br /&gt;&lt;br /&gt;There are two solutions in Harmony for this situation. One is to create more finalizing threads to compete with the mutators for processor resource, and hopefully executing more finalizers than generated by the mutators. The other is to block the guilty mutators until the queue of finalizable objects are shortened by finalizing threads. GCv4.1 adopts the first solution, while GCv5 adopts the second solution.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Synchronization between finalizing thread and mutator&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Finalizer may have synchronized blocks (or methods), which sometimes compete with mutators for locks. A scenario can be that, a mutator only generates objects with finalizer in a synchronized block, while the finalizer needs to hold the same lock to execute. This is a contention: If the mutator holds the lock, it will generate objects continuously till the heap is run out; if the finalizer holds the lock, the mutator will not be able to get chance to make progress. The contention can be relieved by the load balance mechanism.&lt;br /&gt;&lt;br /&gt;But the trickier issue is, due to the load balance mechanism, the mutator may be blocked inside the synchronized block (i.e., when holding the lock). Since the lock is held by the mutator, the finalizer cannot proceed either. This is a sort of deadlock. The solution is to block the mutator with a timer. Once the time is out, the mutator will resume its execution without caring about the finalizable queue length. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Comparison of the load balance mechanisms&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It is hard to say which mechanism is absolutely better: to start more finalizing threads or to block the mutator.&lt;br /&gt;&lt;br /&gt;To start more finalizing thread has an assumption that the OS will fairly schedule the threads so that more threads means more time slices. This assumption may not always be true. Without a control over the rate of finalizable object creation, more finalizing threads cannot guarantee the prompt execution of the finalizers. The heap can be run out sooner, and more importantly the resources that can only be released by finalizers will not be available sooner. &lt;br /&gt;&lt;br /&gt;On the other hand, to block the mutator's generating of objects with finalizer has the synchronization issue. In this mechanism, the system wishes the finalizing threads can execute finalizers efficiently during the mutator blocking time; but this wish cannot be satisfied if the mutator is blocked with lock held, and there is no other controlled chance for the finalizing thread to execute. One solution for this issue is to increase the chances. For example, besides the allocation site, the guilty mutator can also be blocked when it is going to hold a lock. &lt;br /&gt;&lt;br /&gt;Harmony GCv5 chooses mutator blocking mechanism because of two reasons:&lt;br /&gt;1. Normally in common applications, it can work very well with clear semantics and controlled (fixed) finalizing threads number;&lt;br /&gt;2. Even with the extreme synchronization scenario, it still works, although slowly. &lt;br /&gt;&lt;br /&gt;This reflects our design philosophy that: be simple for common case, and still work for unusual case. Probably the best mechanism is to have a solution that can combine the two.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-6272664916190053139?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/6272664916190053139/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=6272664916190053139' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6272664916190053139'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6272664916190053139'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/finalization-subsystem-design-in-apache.html' title='Finalization subsystem design in Apache Harmony'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1142054557207210653</id><published>2007-05-03T23:38:00.001-07:00</published><updated>2007-05-04T00:33:49.379-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Harmony GCv5 64-bit Support</title><content type='html'>Harmony GCv5 was originally designed with 32-bit architecture in mind. In first quarter of 2007, it was enhanced to "support" 64-bit platforms. I use the quotation mark because the support is only available in a special form, i.e., it only works in compressed reference mode. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Compressed reference&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;Normally in current available 64-bit machines, people's applications usually run with limited heap size, smaller than 4GB. That means, although the platform gives a potential of 4TB heap space, we use only a portion of it, which can be covered in a 32-bit address range. This observation enlightened the idea of compressed reference, where the runtime uses 32-bit compressed address for object reference representation. The real address is an addition of the 32-bit address value and a heap base address value. And this heap base address' compressed value is zero. In order to distinct this zero value from the NULL reference, we simply avoid to have the zero value by setting the heap base address a few bytes lower than real heap start address. &lt;br /&gt;&lt;br /&gt;We encode all the reference fields in 32-bit compressed mode, and we also use 32-bit to encode vtable field in object header. Since the obj_info field is kept 32-bit in both platforms, the total object header overhead remains two 32-bit words (or one 64-bit word). &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Object reference&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;The "compressed reference" is only a form of object reference representation. There is no requirement in JVM specification on the reference representation. To have it 32-bit or 64-bit or whatever is completely JVM internal design issue. It is possible to have hybrid reference representations. The only deciding factor is the cost-efficiency in both space and time. &lt;br /&gt;&lt;br /&gt;With this in mind, GCv5 defines REF type for an object reference. GC has no idea about the layout (or physical meaning) of a REF value, except it is an object reference. Anytime when the collector accesses a reference, it always calls ref_to_obj_ptr() to convert the REF value to a real address pointer. Conversely, the collector needs to call obj_ptr_to_ref() to encode an address into a reference. The real encoding rule is decided by the implementation of this function. In a 32-bit platform, this function can simply return the same value untouched. In a 64-bit platform, it can compress the pointer. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Heap space size&lt;/b&gt;&lt;br /&gt;Currently REF is defined as a 32-bit value type. This is not necessarily to be the only option. And this choice doesn't necessarily mean Harmony can only support heap size smaller than 4GB in a 64bit platform.&lt;br /&gt;&lt;br /&gt;There are a couple of ideas to apply 32bit object reference for bigger than 4GB heap size. One technique is to have a few 32bit spaces, each of them needs only 32bit object reference. We can use the few LSB bits of a reference to index the space for its heap base address, since object alignment at 4 or 8 bytes leaves the 2 or 3 LSB bits always zero.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1142054557207210653?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1142054557207210653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1142054557207210653' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1142054557207210653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1142054557207210653'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/05/harmony-gcv5-64-bit-support_03.html' title='Harmony GCv5 64-bit Support'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-7671927647094673653</id><published>2007-04-23T05:03:00.001-07:00</published><updated>2007-04-23T05:16:31.681-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Memory problems in J2EE applications?</title><content type='html'>I saw following notes from Steven Haines' weblog entry on May 18, 2005:&lt;br /&gt;&lt;br /&gt;=========================&lt;br /&gt;&lt;i&gt;&lt;br /&gt;As I travel around the country tuning J2EE environments, by far the number one problem is memory. Application memory problems come in two flavors: &lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Lingering Objects/References: objects that pass the reachability test and are valid in the JVM, but are not "live", meaning that the application is holding the reference without the intent of using the object again in the future. This is what we refer to as a Java Memory Leak&lt;br /&gt;&lt;li&gt;Object Cycling: rapidly creating and destroying objects. This can be an error or be forced to try to better control garbage collection: destroy objects so that they will be quickly reclaimed&lt;/ul&gt;&lt;br /&gt;While these are application or programming errors, there is much that a JVM could do to detect these and help developers identify them. The Sun JVM provides some APIs, such as JVMPI, to provide developers the ability to learn more about the heap, but the problem is the overhead in using these features. These features and not suitable for running in a production environment. A more robust approach that monitoring companies have been trying to subvert present certain dangers (such as crashing the entire JVM) that if built natively into the JVM could mitigate these risks. From a monitoring and management perspective, a successful and accepted open source standard could mean an evolutionary step in enhancing the stability of high volume enterprise applications.&lt;br /&gt;&lt;br /&gt;While there are true memory leaks in J2EE applications that cause frequent crashes, many times the problem is more about the proper configuration of the heap to the behavior of the applications running within the heap. During most of the J2EE tuning engagements that I deliver I can identify architectural flaws built into applications, but most problems can be resolved or at least reduced by tuning the heap. This tuning process requires a deep knowledge of the JVM's object lifecycle management and garbage collection strategies and is unnecessarily complex. A deep analysis of garbage collection algorithms and an adaptive strategy could help the JVM perform ideally for the applications running in the heap. It is a tough problem, but if the project is open source, it would allow deeper research and a quicker evolution of revolutionary changes to the way the JVMs run today.&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;============================&lt;br /&gt;&lt;br /&gt;I am not familiar with J2EE applications' behavior, but the quotes above concur with my limited experience and impression with J2EE workloads. It is good to have the problems clearly stated. Probably Harmony GCv5 can spend some efforts in them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-7671927647094673653?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/7671927647094673653/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=7671927647094673653' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7671927647094673653'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/7671927647094673653'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/memory-problems-in-j2ee-applications_23.html' title='Memory problems in J2EE applications?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-6178781821911106231</id><published>2007-04-23T03:16:00.000-07:00</published><updated>2007-04-23T05:17:08.170-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Harmony GCv5 is turned on as default GC</title><content type='html'>Harmony DRLVM architecture has a good modular GC component design. GC can be built as a shared object (or dynamic linked library) and plugged into the VM, as long as the GC is written by following the defined interface between VM and GC. Different GC implementations can be specified in command line option. By default there is a GC implementation is selected in Harmony. &lt;br /&gt;&lt;br /&gt;Here is a &lt;a href="http://harmony.apache.org/subcomponents/drlvm/gc-howto.html"&gt;GC Developer's Guide&lt;/a&gt; giving step by step instructions on writing a simple GC from the scratch.&lt;br /&gt;&lt;br /&gt;Currently there are three independent garbage collection modules implemented in Harmony DRLVM. They are GCv4, GCv4.1, and GCv5. All of them are stop-the-world garbage collection.&lt;ul&gt;&lt;br /&gt;&lt;li&gt;GCv4 is legacy and no longer maintained, which is a mark compaction collector based on LISP2 compactor algorithm. GCv4 is a sequential non-generational collector.&lt;br /&gt;&lt;li&gt;GCv4.1 is a copying collector with a compaction fall-back. The compactor is based on the threaded reference algorithm. GCv4.1 is a sequential non-generational collector.&lt;br /&gt;&lt;li&gt;GCv5 is a fully parallel GC, which can work in both generational and non-generational modes. GCv5 achieves rather good scalability on parallel machines, and has dynamic runtime adaptations for best throughput.&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Two days ago, I made a change in Harmony source code to specify GCv5 as the default GC. Since GCv5 is a new GC, developed in very short time (relatively), the transition period would cause some test regressions. The change of the default GC with GCv5 is only a trial. It would switch back if GCv5 causes serious regressions. &lt;br /&gt;&lt;br /&gt;So far, the good news is I haven't seen new failure reports caused by GCv5 that were not existent with GCv4.1, the previous default GC.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-6178781821911106231?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/6178781821911106231/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=6178781821911106231' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6178781821911106231'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/6178781821911106231'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/harmony-gcv5-is-turned-on-as-default-gc.html' title='Harmony GCv5 is turned on as default GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-94196367112492407</id><published>2007-04-23T03:05:00.000-07:00</published><updated>2007-04-23T05:17:08.171-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Harmony GCv5 design overview</title><content type='html'>There is no document available yet for Harmony GCv5 design. I quickly wrote up the following words to highlight some GCv5 features. It is also available at Harmony wiki &lt;a href="http://wiki.apache.org/harmony/MemoryManager"&gt;Harmony Memory Manager&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Spaces&lt;/b&gt;&lt;br /&gt;GCv5 partitions the heap space into NOS (nursery object space), MOS (mature object space), and LOS (large object space). The boundaries between them are dynamically adjustable by GC automatically according to the space utilization. Normal objects are only allocated in NOS, or LOS if the size is bigger than a threshold. MOS is used to for the survivors in NOS.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Collections&lt;/b&gt;&lt;br /&gt;There are basically two modes of collections: minor and major. Minor collection copies live objects from NOS to MOS. Major collection compacts NOS and MOS, and sweeps LOS. There are other modes of collections for special situations. When the NOS is inadequate to accommodate MOS survivors during a minor collection, the collection will transition into a major collection. This is called fallback collection. When GC discovers that LOS and MOS are not equally fully utilized, it will trigger a &lt;br /&gt;extension collection, which extends either LOS or MOS, and shrink the other one.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Collectors&lt;/b&gt;&lt;br /&gt;GCv5 uses by default depth-first copying algorithm in minor collection. It supports also breadth-first copying and allocation-order copying as well. Major collection in GCv5 has two implementations, one is classic LISP2 compactor, the other is 2-pass compactor. Both of them are fully parallelized while preserving the slide-compact property. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Runtime adaptation: major or minor&lt;/b&gt;&lt;br /&gt;Since minor collection is usually much shorter in pause time compared to major collection, we want to have minor collection mostly. Well on the other hand, major collection can usually free more space, which is important for the minor collection to really perform. GCv5 developed automatic adaptation to switch between minor and major collections for best throughput.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Runtime adaptation: gen or non-gen&lt;/b&gt;&lt;br /&gt;GCv5 can work in generational mode, where minor collection uses remember set information, and non-generational mode, where minor collection needs to trace the entire heap for live object marking. Generational mode has advantage when the entire heap traversal is too much time consuming, while its downside is the write barrier overhead. GCv5 developed an innovation that can switch dynamically between gen and non-gen mode. This adaptation is turned off by default, since its performance depends on the workload's behavior.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Parallel load balance&lt;/b&gt;&lt;br /&gt;GCv5 developed a couple of load balance mechanisms in past. Now only pool-sharing is kept by default. The other two candidates that might be applied in future are work-stealing and task-pushing. Task-pushing uses the idea of Communicating Sequential Process (CSP) for parallel task assignment among the collectors. Pool-sharing is somehow similar to work-packet mechanism, but pool-sharing is depth-first order, which is believed to have better access locality.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Abstraction: threads and allocator &lt;/b&gt;&lt;br /&gt;In design-wise, GCv5 has an abstraction on collector and mutator concepts, both of which are the subclass of allocator, since they are equal when the mutator allocates in NOS and the collector allocates in MOS.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Abstraction: collection space and GC&lt;/b&gt;&lt;br /&gt;GCv5 also has an abstraction on space. In GCv5, space and a collection algorithm is tied together. E.g., when we say Fspace, we mean the space that is managed by copying algorithm. In this way, a GC is only a combination of multiple spaces, or it is a collaborator of multiple collection algorithms over GC heap. It decouples the collection algorithm from GC construction, hence easing the construction of a new GC based on the existing collection algorithms.&lt;br /&gt;&lt;br /&gt;Below is a presentation on Harmony GCv5 design and status:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Xiao-Feng Li, &lt;a href="http://people.apache.org/~xli/docs/harmony_gcv5_overview.pdf"&gt;Harmony GCv5 Overview&lt;/a&gt;, April 22, 2007.&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-94196367112492407?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/94196367112492407/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=94196367112492407' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/94196367112492407'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/94196367112492407'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/harmony-gcv5-design-overview.html' title='Harmony GCv5 design overview'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3575189923982557875</id><published>2007-04-16T03:29:00.000-07:00</published><updated>2007-04-23T05:16:31.682-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><title type='text'>Throw OutOfMemory exception during collection</title><content type='html'>Some people have a confusion about the semantic of OutofMemory (OOM) exception. For example, during moving GC development, they may think the GC should throw OOM if the targe to-space is not enough to hold the live objects.&lt;br /&gt;&lt;br /&gt;This is actually incorrect understanding of OOM. From Java programmer's point of view, OOM is thrown only when there is no enough memory available for new object allocation. It should have nothing to do with GC process. In other words, GC should not throw OOM. &lt;br /&gt;&lt;br /&gt;This has an important implication, i.e., GC should always succeed. That is, a collection must be able to finish correctly, meaning the heap state should kept consistent after the collection. &lt;br /&gt;&lt;br /&gt;If a copying GC can't move all the live objects to the target space, it only means the GC algorithm is problematic. The reason is simple: All those live objects exist there in heap before the collection, there is no reason for the heap to become insufficient after the collection. An OOM can be throw when there is still inadequate free space for an object allocation after a collection. &lt;br /&gt;&lt;br /&gt;This concept is important because, when an OOM is thrown, the heap data should be consistent so that the Java program can continue its execution, e.g., to catch the OOM exception and set some references to be NULL. Then the subsequent allocations may be satisfied in next collection and the program can proceed to execute.&lt;br /&gt;&lt;br /&gt;With the above being said, sometimes the size of a live object can increase during collection, e.g., a "hashed" object as discussed in my previous blog entry [1]. There are other cases as well, in dynamic typed languages. The increased object sizes might cause difficulties in some traditionally-good GC algorithms.&lt;br /&gt;&lt;br /&gt;[1] http://xiao-feng.blogspot.com/2007/04/object-hashcode-implementation.html&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3575189923982557875?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3575189923982557875/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3575189923982557875' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3575189923982557875'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3575189923982557875'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/throw-outofmemory-exception-during.html' title='Throw OutOfMemory exception during collection'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4116366209280910762</id><published>2007-04-16T02:26:00.001-07:00</published><updated>2007-05-03T23:31:49.058-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Object hashcode implementation</title><content type='html'>Java specification gives each object a hashcode, whose value is constant during the object lifetime. We've tried following implementations of hashcode in Harmony.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. Hashcode is encoded in object header.&lt;/b&gt;&lt;br /&gt;Define an object header for each object that has two words in 32-bit platforms. (In 64-bit platforms, it can be longer). One word is used for vtable. The other is used for some object specific information such as locking, hashcode, and GC-related bits. We call this word obj_info in Apache Harmony. With careful layout of the obj_info, we can leave 8 bits to hashcode, which can represent 256 unique hash values.&lt;br /&gt;&lt;br /&gt;The value range is too small but usually good enough, because 1) most Java applications do not depend on a large hash value range for performance; 2) if certain Java applications really need a large value range, we assume they usually have their own hashing mechanism and data structure that do not use JVM hashcode implementation.&lt;br /&gt;&lt;br /&gt;This simple implementation works normally. But there is some application depending on JVM hashcode implementation that has a large value range.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2. Object address for hashcode.&lt;/b&gt;&lt;br /&gt;Use object address for hashcode. It is for sure unique if the GC is non-moving. It needs a mechanism to support object movement. Bacon [1] proposed a solution to use three bits in obj_info to indicate the status of hashing: unhashed, hashed and moved. For the first two states, object address is used for hashcode; for the third one after the object is moved, it's original address is stored in an added extra field of the object.&lt;br /&gt;&lt;br /&gt;This method can provide 32-bit hash values. The range is large enough. But there is an issue in it. When an object is moved and an additional field is added, the object size is increased after movement. The extreme case where all the objects are moved with hash, the target space might be inadequate to accommodate all of them. For example, in semi-space copying, assuming all the objects are live hashed, the to-space will be not enough to hold all of them with an additional fields. Hopefully this will never happen in reality, but a solution should be designed just in case.&lt;br /&gt;&lt;br /&gt;We found a solution for this case. So for semi-space copying when the hashcodes make the space inadequate, it will fall back to LISP2 compaction algorithm. For compaction algorithm, it is even easier. People may think in-place compaction has no extra space for the additional hashcodes if all objects are live; but actually if an object cannot move to a new location for more free space, it can keep unmoved, so that the hashing status is unchanged to continue using its address for hashcode.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3. Hash table for hashcode.&lt;/b&gt;&lt;br /&gt;To keep object hashcode in a hash table. The hash table is indexed with object address. For every time of moving collection, the hash table needs rehashing to maintain the key-value mapping. In case the hash table is small, the rehashing process can be cheap.&lt;br /&gt;&lt;br /&gt;We developed an easy way for fast rehashing. We don't use a big hash table for all the objects' hashcode； instead, a hash list is maintained for a block. The rehashing process is then to rebuild the a hash list for each block. In case the block size is small (e.g., 32KB) and number of hashcode is small, the rehashing can be distributed and done efficiently during object moving. To retrieve an object hashcode, GC can simply find the block the object stays in, and get the hash list of that block, then  use hashing or binary searching to locate the hashcode quickly. It normally takes only a couple of memory operations. &lt;br /&gt;&lt;br /&gt;The overhead of this solution is the extra space for the hash lists. But that's also its advantage that it does not need to extend object with a hashcode field. This is important for some GC algorithms that assume the object size never changes in its lifetime.&lt;br /&gt;&lt;br /&gt;All these have been implemented in Apache Harmony. Actually the three solutions can be combined to make a solution that works best with GC algorithms and applications.&lt;br /&gt;&lt;br /&gt;[1] BACON, D. F., KONURU, R., MURTHY, C., AND SERRANO, M. Thin locks: Featherweight synchronization for Java. PLDI'98.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4116366209280910762?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4116366209280910762/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4116366209280910762' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4116366209280910762'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4116366209280910762'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/object-hashcode-implementation.html' title='Object hashcode implementation'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-573829392657496990</id><published>2007-04-10T07:26:00.000-07:00</published><updated>2007-06-16T22:41:48.608-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Programming'/><category scheme='http://www.blogger.com/atom/ns#' term='Multi-core'/><title type='text'>Can "transactional memory " be a new programming model?</title><content type='html'>Transactional memory gets popular in recent years. &lt;br /&gt;&lt;br /&gt;If I remember correctly, Maurice Herlihy and J. Eliot B. Moss's paper [1] is the first one introducing this concept. But it was ignored for a long time until Ravi Rajwar and James R. Goodman's paper on Speculative Lock Elision [2]. Then the next paper [3] by the same authors kicked off new round of hot research on transactional memory.&lt;br /&gt;&lt;br /&gt;Soon after that, people started to design new programming model for transactional memory in order to fully leverage the benefits. The main idea is to introduce the concept of transaction at programming level, so that the programmer can declare and control the execution of transactional code regions. One simplified model in Java programming is to treat all the synchronized blocks/methods as transactions.&lt;br /&gt;&lt;br /&gt;The idea of transactional programming tends to decouple the performance pursuit from   programming productivity, claiming to achieve both simultaneously. The programmer doesn't need to care about lockings, and the critical section execution only fails when there is real runtime data conflicts. &lt;br /&gt;&lt;br /&gt;The main target of transactional programming is for multi-core or many-core future platforms. But I personally really doubt about this model. I don't think fine-grained transaction control should be exposed to common programmer. The reason is, transactional programming exposes non-intuitive code sequence execution semantics to programmers, which the programmers would never like to see.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. Low-level semantics should not be exposed (normally)&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;My objection to fine-grained transactional programming is based on my experience with memory consistency model. Weak-order memory model exposes lots of optimization opportunities to programs, but as a programmer, I never want to reason my program in weak-order; on the contrary, I always reason my program in sequential consistency model, unless for very tricky high-performance concurrent code, such as the work-stealing algorithm I used in GC marking phase. In this sense, transactional programming should only be used when they are indeed necessary, because they are too low level. &lt;br /&gt;&lt;br /&gt;Transactions at higher level have no problem. People play with them for decades in database management. The purpose of databased transactions is much different from transactional memory programming. They are for correctness plus performance with the ACID properties guarantee. Programmability is not in the consideration largely, which is addressed by SQL instead. Now the transactional memory is going to an opposite direction. It was proposed in Speculative Lock Elision for performance that is transparent to the programmers, while the current trend is to achieve it through explicit programming. &lt;br /&gt;&lt;br /&gt;I do not believe one can program the main memory as database with transactions.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2. Single-thread execution semantics should keep same&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The other reason I don't think transactional memory can be a successful programming model is, it can't always keep the semantics of single-threaded application. When programming for weak-order memory model, I don't need worry for its single-thread execution: It always runs in the same "appearance" as sequential consistency model. Or in other words, the same code should always deliver the same result if executed in single thread, no matter what memory model is used, and no matter there are memory fence/barrier instructions. Those instructions can be totally removed without impacting the single thread execution in any case. &lt;br /&gt;&lt;br /&gt;But this may not always be the case with transactional memory programming, for example, the semantics of exceptions thrown in a transaction. In single thread execution, the transaction control instructions can not always be removed, since some transactional programming model may require to abort the partial transaction result before the exception is thrown. This is really not intuitive. Common programmers would never want to think of that way.&lt;br /&gt;&lt;br /&gt;I have more comments on transactional programming model to talk later.&lt;br /&gt;&lt;br /&gt;[1] Maurice Herlihy and J. Eliot B. Moss, Transactional memory: architectural support for lock-free data structures, ISCA'93.&lt;br /&gt;[2] Ravi Rajwar and James R. Goodman, Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution, MICRO'01.&lt;br /&gt;[3] Ravi Rajwar and James R. Goodman, Transactional Lock-Free Execution of Lock-Based Programs, ASPLOS'02.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-573829392657496990?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/573829392657496990/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=573829392657496990' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/573829392657496990'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/573829392657496990'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/can-transactional-memory-be-new.html' title='Can &quot;transactional memory &quot; be a new programming model?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1551337975897702061</id><published>2007-04-03T06:22:00.001-07:00</published><updated>2007-04-23T05:16:31.684-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Sequential in-place compacting garbage collectors</title><content type='html'>In-place compacting GC charms for its ability to defragment the heap space on-site: it can squeeze the free area out of a used space. The defragmentation helps a couple of things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;It produces continuous large free space, so that allocation can be done by bumping pointer, and large objects can be successfully accommodated.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;The live objects are compacted together, thus the access locality is improved.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Almost no space is wasted. Compaction supports largest working set in all available collection algorithms.&lt;/li&gt;&lt;br /&gt;&lt;/ol&gt;&lt;br /&gt;Semi-space copying GC also compacts the live objects, but not in-place. The live objects are moved into another area. Compacting GC in common sense only refers to those that can compact in-place.&lt;br /&gt;&lt;br /&gt;The main issue to solve in compacting GC is how to keep the forwarding pointer of an object. The forwarding pointer is the new location of the object. Since an object can be referenced by any other objects in the heap, its forwarding pointer has to be stored somewhere so that other objects can update their reference field accordingly. I will discuss three different compactors, especially on how they keep the forwarding pointer.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. LISP2 compactor&lt;/b&gt; &lt;br /&gt;The new location of an object can be computed in one pass through the heap. If the value is kept in object header, the object cannot be moved until all the references are updated. This leads to the LISP2 compactor [1], which employs four phases (or heap passes).&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Phase 0. marking live objects; (not counted as a pass of compaction)&lt;br /&gt;&lt;li&gt;Phase 1. computing object target location (and installing forwarding pointer in object header);&lt;br /&gt;&lt;li&gt;Phase 2. repointing all the references to their new locations;&lt;br /&gt;&lt;li&gt;Phase 3. slide-compacting live objects to one end of the heap;&lt;br /&gt;&lt;li&gt;Phase 4. restoring the forwarding pointer in object header.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;2. Chained reference compactor&lt;/b&gt;&lt;br /&gt;The second well-known compactor is Jonkers and Morris threaded pointer algorithm [2][3]. The key innovation in this algorithm is to eliminate the extra space requirement for forwarding pointers. It doesn't actually save the new location of an object; instead, it updates the references to an object when the object is processed. &lt;br /&gt;&lt;br /&gt;This idea of this compactor is to chain (or thread) the references to an object; then when it is the object's order of being processed, the compactor traces back to all those references to this object along the chain, and update them to the new location. Well in this way, only the references in those objects before this one can be updated, i.e., forward-direction references. For those backward-direction references, since they are unknown yet, the compactor will chained them when they are met. This second chain will be traced and the references are updated in second heap pass. The object is actually moved in order in the second pass. Since the chain only exists in objects after the processed one, its move doesn't destroy any chains. &lt;br /&gt;&lt;br /&gt;The chain is formed by converting a parent-children tree (an object and the references to it) into a child-sibling tree (starting from the object to those chained children) on-the-fly during the collector scans the heap. This compactor has two passes:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Phase 0. marking live objects;&lt;br /&gt;&lt;li&gt;Phase 1. scan heap sequentially to update all the forward-direction references ;&lt;br /&gt;&lt;li&gt;Phase 2. scan heap again to update all the backward-direction references, and move objects.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;3. IBM's move compactor&lt;/b&gt;&lt;br /&gt;The third compactor was proposed by IBM [4]. (It is actually a parallel compactor; here I discuss only the design idea.) It can be viewed as an improvement of the LISP2 compactor, but has only two passes. The idea is interesting. It realized the reason for the multiple phases of LISP2 compactor is that, it has to keep the forwarding pointer in object header, which cannot be overwritten until it's read (i.e., used). So it can only move the objects after all the references are repointed. If the forwarding pointer is saved in an auxiliary data structure, and the mapping relation can be established between the object and its forwarded location, there is no problem to move the objects in any phase.&lt;br /&gt;&lt;br /&gt;To implement the idea, the heap is partitioned into sections. Each section has an corresponding entry in an offset table. The section is viewed as a macro object, and its forwarding pointer is stored in the offset table. In this way, the variable-sized object compaction problem is converted into a fixed-sized section compaction problem.&lt;br /&gt;&lt;br /&gt;The efficiency of the algorithm depends on an assumption that many sections have no live objects, hence are dead sections. The choice of section size is important. This can be optimized if we use the beginning of the first live object of an section as the beginning of the macro object, and use the end of the last live object in the section as the macro object's end. The phases of the algorithm are:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Phase 0. marking live objects;&lt;br /&gt;&lt;li&gt;Phase 1. slide-compacting sections, keeping the move distance in an offset table;&lt;br /&gt;&lt;li&gt;Phase 2. updating all the references by adding the distance value.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;(I have an idea to further optimize this algorithm for certain workloads.)&lt;br /&gt;&lt;br /&gt;[1] Richard E. Jones. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley, Chichester, July 1996.&lt;br /&gt;[2] H. B. M. Jonkers. A fast garbage compaction algorithm. Information Processing Letters, 9(1):25–30, July 1979.&lt;br /&gt;[3] F. Lockwood Morris. A time- and space-efficient garbage compaction algorithm. Communications of the ACM, 21(8):662–5, 1978.&lt;br /&gt;[4]Diab Abuaiadh, Yoav Ossia, Erez Petrank, and Uri Silbershtein. An efficient parallel heap compaction algorithm. In OOPSLA'04.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1551337975897702061?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1551337975897702061/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1551337975897702061' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1551337975897702061'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1551337975897702061'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/sequential-compacting-garbage-collector.html' title='Sequential in-place compacting garbage collectors'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-8355877966944185333</id><published>2007-04-01T21:30:00.001-07:00</published><updated>2007-04-23T05:16:31.680-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Incremental-update vs. Snapshot-at-the-beginning tracing</title><content type='html'>I compared the algorithm of mostly concurrent and on-the-fly GCs in implemenetation details in last blog entry. This article gives some theoretical summary of the two algorithms. It's basically excerpt from the Memory Management Glossary maintained by  &lt;a href="http://www.memorymanagement.org/"&gt;www.memorymanagement.org&lt;/a&gt; [1].&lt;br /&gt;&lt;br /&gt;In my understanding, the mostly concurrent GC uses incremental-update tracing. Below is the explanation of the term incremental-update in memorymanagement.org:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;incremental-update, incremental update&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-left: 40px; font-style: italic;"&gt;Incremental-update algorithms for tracing, incremental GC note changes made by the mutator to the graph of objects and update the collector state to make it correctly trace the new graph.&lt;br /&gt;&lt;br /&gt;In order for the collector to miss a reachable object, the following two conditions need to hold at some point during tracing:&lt;br /&gt;&lt;br /&gt;1. The mutator stores a reference to a white object into a black object.&lt;br /&gt;2. All paths from any gray objects to that white object are destroyed.&lt;br /&gt;&lt;br /&gt;Incremental-update algorithms ensure the first condition cannot occur, by painting either the black or the white object gray (see Barrier techniques for incremental tracing [2] for details).&lt;br /&gt;&lt;br /&gt;They are so called because they incrementally update the collector's view of the graph to track changes made by the mutator.&lt;br /&gt;&lt;br /&gt;Historical note: This distinction between incremental-update and snapshot-at-the-beginning was first introduced for write-barrier algorithms, but it applies to any type of tracing algorithm.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt; Opposites&lt;/span&gt;: snapshot-at-the-beginning.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;Now is the explanation of the term stopshot-at-the-beginning:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;snapshot-at-the-beginning, snapshot at the beginning&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-left: 40px; font-style: italic;"&gt;Snapshot-at-the-beginning algorithms for tracing, incremental GC note changes made by the mutator to the graph of objects and update the collector state to make it trace relevant edges that the mutator deletes.&lt;br /&gt;&lt;br /&gt;In order for the collector to miss a reachable object, the following two conditions need to hold at some point during tracing:&lt;br /&gt;&lt;br /&gt;1. The mutator stores a reference to a white object into a black object.&lt;br /&gt;2. All paths from any gray objects to that white object are destroyed.&lt;br /&gt;&lt;br /&gt;Snapshot-at-the-beginning algorithms ensure the second condition cannot occur, by causing the collector to process any reference that the mutator overwrites and that might be part of such a path.&lt;br /&gt;&lt;br /&gt;They are so called because they keep track of references that existed at the beginning of the collection cycle. Note that this does not mean all modifications need to be seen by the collector, only those needed to complete tracing without missing a reachable object (see Barrier techniques for incremental tracing [2] for details), nor does it mean that it won't trace some references created during the collection.&lt;br /&gt;&lt;br /&gt;Historical note: This distinction between incremental-update and snapshot-at-the-beginning was first introduced for write-barrier algorithms, but it applies to any type of tracing algorithm.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt; Opposites&lt;/span&gt;: incremental-update.&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;Thanks to the glossary maintainer. I think the explanantions very well abstract the key difference between the two tracing algorithms, hence the difference between mostly-concurrent and on-the-fly GCs.&lt;br /&gt;&lt;br /&gt;[1] &lt;a href="http://www.memorymanagement.org/glossary/"&gt;http://www.memorymanagement.org/glossary/&lt;/a&gt;&lt;br /&gt;[2] Pekka P. Pirinen. Barrier techniques for incremental tracing. ACM. ISMM'98 pp.20-25&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-8355877966944185333?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8355877966944185333'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/8355877966944185333'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/04/incremental-update-tracing-vs-snapshot.html' title='Incremental-update vs. Snapshot-at-the-beginning tracing'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-5881632507302749043</id><published>2007-03-29T04:02:00.000-07:00</published><updated>2007-04-23T05:16:31.680-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Snapshot-based GC vs. mostly concurrent GC</title><content type='html'>As discussed previously that the two on-the-fly GCs (DLG and SlidingView) [1] are actually the same idea. Both are based on the snapshot-at-the-beginning (SATB) GC. They do not require stop-the-world (STW) for root marking phase, but enumerate (suspend) the mutators one after another.&lt;br /&gt;&lt;br /&gt;I also discussed the mostly concurrent GC.[2] Both mostly-concurrent and snapshot-at-the-beginning are mark-sweep GC, with phases of marking, tracing, and sweeping. But their ideas are quite different, and the implications to implementations are quite different as well.&lt;br /&gt;&lt;br /&gt;It's easy to understand some key differences:&lt;br /&gt;&lt;br /&gt;1. SATB doesn't need STW to terminate one collection cycle. It terminates once there is no gray objects. Mostly-concurrent has to have a STW phase for final tracing to guarantee the correctness. &lt;br /&gt;&lt;br /&gt;The reason is, SATB only tries to identify the known garbage at the time of the snapshot is taken, so it only traces the object connectivity graph of the snapshot. This is a finite and constant number. Because a traced object will not be traced, the number of objects to be traced is monotonically decreased, and finally reaches zero. &lt;br /&gt;&lt;br /&gt;But mostly-concurrent needs to track all the updated objects, to mark them dirty if they are clean. Along with the mutators' execution, the number of dirtied objects is increasing. Although the total number of live objects are finite, the problem is, a traced object is cleaned and might be dirtied again. This causes the number cannot reduce monotonically. Thus a STW is a must to stop the dirty objects generation.&lt;br /&gt;&lt;br /&gt;The pause time for the final tracing STW is a runtime result. GC can be carefully designed to make it controllable.[3][4] &lt;br /&gt;&lt;br /&gt;2. When I was saying the number of live objects above, I didn't mean to include the newly allocated objects after the tracing phase is started. It's another key difference between SATB and mostly-concurrent about how to handle the new objects. &lt;br /&gt;&lt;br /&gt;In SATB, the new objects are created marked (black); there is no need to trace them, since they are not in the snapshot. It's easy to simply add them into the live objects list. And write barrier doesn't need to catch field updates in them. This is a big saving, since new objects are mostly active, with frequent updates; and the number if new objects is usually big. The downside of this handling is, new objects are mostly dying soon. To blindly take them live may retain much floating garbage.&lt;br /&gt;&lt;br /&gt;In mostly-concurrent, the new objects are created clean, then is treated as other common objects. That means, the write barrier need catch the field updates in the new objects, and the tracing needs to scan (and rescan) them. This increases tracing overhead. An optimization is not to dirty new objects at all, but to trace them in the final tracing phase. This avoids repetitive rescans over new objects. (Since an object only needs dirtying only if it has been scanned, technique should be taken not to scan the new objects, hence no need to rescan. This opt was reported to improve the performance significantly.[4]) &lt;br /&gt;&lt;br /&gt;3. Write barrier difference. The write barrier in SATB is designed to catch any reference updates in order to save the original reference value in the snapshot. The saved values are stored in a remember set, and is processed by the collector concurrently. SATB cares about only the oldest value.&lt;br /&gt;&lt;br /&gt;In mostly-concurrent, write barrier is only for the collector to know the object is updated. It cares the new values. More importantly, if the dirty object is updated again before it is rescanned, nothing need to do in write barrier, since the object is already and still dirty. The new value of once update is not important. What counts is the reference value when the dirty object is rescanned. Mostly-concurrent cares about only the newest value. The intuition is, the final STW tracing phase will find all the live objects up with the latest reference values, thus may keep less floating garbage.&lt;br /&gt;&lt;br /&gt;Overall, the design philosophy of two concurrent GCs are far from each other. The mostly-concurrent GC is less precise in tracing since it has no steady object connectivity graph as SATB GC: Its steady graph only formed finally during the stop-the-world phase.&lt;br /&gt;&lt;br /&gt;[1]http://xiao-feng.blogspot.com/2007/03/comparison-between-two-on-fly-garbage.html&lt;br /&gt;[2]http://xiao-feng.blogspot.com/2007/03/about-mostly-concurrent-gc.html&lt;br /&gt;[3] Yoav Ossia , Ori Ben-Yitzhak , Irit Goft , Elliot K. Kolodner , Victor Leikehman , &lt;a href="http://www.haifa.il.ibm.com/projects/systems/rs/papers/ParIncrConcurrent_GC_PLDI02.pdf"&gt;A parallel, incremental and concurrent GC for servers&lt;/a&gt;, PLDI02&lt;br /&gt;[4]Katherine Barabash , Yoav Ossia , Erez Petrank, &lt;a href="http://www.haifa.il.ibm.com/projects/systems/rs/papers/MC_GC_Revisited_oopsla03.pdf"&gt;Mostly concurrent garbage collection revisited&lt;/a&gt;, OOPSLA03&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-5881632507302749043?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/5881632507302749043/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=5881632507302749043' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5881632507302749043'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/5881632507302749043'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/03/comparison-between-snapshot-based-gc.html' title='Snapshot-based GC vs. mostly concurrent GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3269452362944347484</id><published>2007-03-25T04:20:00.000-07:00</published><updated>2007-04-23T05:16:31.683-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>About mostly concurrent GC</title><content type='html'>"Mostly concurrent GC" is a specific term for the GC technique invented by Boehm et al. [1]. Interestingly, Boehm called this technique "Mostly Parallel GC", meaning the collector thread can run in parallel with the mutator in most period of the collection cycle. &lt;br /&gt;&lt;br /&gt;(Now the GC community usually uses "concurrent" for the parallelism between mutator and collector, while using "parallel" for the parallelism between collector threads. In the "old days" literature, "incremental" sometimes was also used to refer to concurrent collection, because a concurrent collection behaves almost the same as incremental in a uniprocessor platform, where the executions of collector and mutator threads are interleaved on the same processor. Today the term "incremental" refers to the technique that the mutator threads do some collection work during its exectution, e.g., a mutator can do certain amount of tracing work when it allocates an object.)&lt;br /&gt;&lt;br /&gt;Mostly concurrent GC is mark-sweep GC. The basic idea is to mark the heap concurrently with the mutator execution, and use write barrier to mark dirty the object whose reference field is updated. Then the heap will be rescanned from the dirtied obects as new roots. As normal tracing does, the rescanning does not scan the marked objects that are not dirtied. The reason to use write barrier is, some object Obj1's only reference may be written into a marked object. This marked object must be rescanned to find this object Obj1. The dirty flag will be cleaned when it's scanned. It might be dirtied again if its reference field is updated again. Hopefully a next rescanning process will scan less live objects assuming there are not many dirtied objects. Newly allocated objects do not needs to be created dirtied, because their references must be written into roots or other objects. The latter case will be caught by write barrier, and the former case will be dealt with in a "final tracing" phase.&lt;br /&gt;&lt;br /&gt;Since the heap is kept being dirtied by the mutator, this process will never terminate. The GC has to stop the world when appropriate to have a final tracing phase. The final tracing phase marks the heap during the world is stopped, so it's guaranteed to mark all reachable objects. This is why the algorithm is "mostly concurrent" rather than "fully concurrent". The key difference between the final tracing and previous rescannings is, the final tracing starts from both mutators' roots and the dirtied objects. &lt;br /&gt;&lt;br /&gt;After the heap is marked by the final tracing phase, the world can be resumed and the sweep phase can be carried out concurrently by the collector. Some implementation of mostly concurrent GC carefully tunes the concurrent/incremental tracing rate so that the heap is run out right at the stop-the-world starting point, so as to achieve best overall throughput. But it has to do the sweep phase during the stop-the-world period because otherwise without freed space, the resumed mutators cannot make progress. [2] (Lazy sweep may be applied to make the sweeping work done incrementally by the mutator, thus to reduce the pause time.)&lt;br /&gt;&lt;br /&gt;When there are multiple mutators, there might need another stop-the-world phase at the beginning of a GC cycle to mark all the roots. Rigidly speaking, I think this stop-the-world is not mandatory, because the final tracing phase ensures the correctness of the algorithm. The initial roots are only a hint for the concurrent marking to find reachable objects: The reachable objects found from the initial roots might be still reachable in the final tracing phase. A snapshot of all mutators' roots can accurately identify the reachable objects (and garbage) at the time of the snapshot is taken. But if many of the reachable objects become garbage during the concurrent marking phase, the snapshot doesn't necessarily mean good GC efficiency.&lt;br /&gt;&lt;br /&gt;Then why don't we just trace from the mutators' roots one set after another, and repeatedly before the final tracing phase? Why do we need to remember the dirty objects and rescan from them instead of doing another round of tracing from roots? I think this is possible (I might be wrong). But I guess this has a problem that the set of live objects it can trace might be smaller than that of the dirty object started rescanning, because there might be many live object trees starting from a dirtied object. If we only start from roots, we may always miss them in the concurrent tracing phase, thus leaving more work to the final tracing phase and longer pause time. Moreover, an updated reference is for sure live at the updating time, so to rescan from it might find real live objects. We need experiments to validata or invalidate this opinion.&lt;br /&gt;&lt;br /&gt;[1] Hans-Juergen Boehm, Alan J. Demers, and Scott Shenker. Mostly parallel garbage collection. SIGPLAN PLDI, 26(6):157-164, 1991.&lt;br /&gt;[2] Yoav Ossia , Ori Ben-Yitzhak , Irit Goft , Elliot K. Kolodner , Victor Leikehman , Avi Owshanko, A parallel, incremental and concurrent GC for servers, PLDI'02, ACM SIGPLAN Notices, v.37 n.5, May 2002&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3269452362944347484?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3269452362944347484/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3269452362944347484' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3269452362944347484'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3269452362944347484'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/03/about-mostly-concurrent-gc.html' title='About mostly concurrent GC'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-1567877101938382906</id><published>2007-03-14T20:18:00.000-07:00</published><updated>2007-04-23T05:16:31.682-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Comparison of two on-the-fly garbage collectors</title><content type='html'>After the discussions on on-the-fly (OTF) GC design, this article will discuss the differences between Azatchi et al's sliding-view based implementation [1] and IBM's implementation for Java [2]. They will be referred as SlidingView GC and IBM's GC respectively in this article.&lt;br /&gt;&lt;br /&gt;IBM's GC is a loyal implementation of DLG algorithm [3] with some adaptation to Java specifics. Azatchi's GC is based the idea of sliding-view. Sliding-view is a relaxed snapshot of heap and roots, which is not taken by stop-the-world synchronization, but by on-the-fly handshakes.&lt;br /&gt;&lt;br /&gt;Basically as concurrent mark-sweep GC, both algorithms have phases of:&lt;br /&gt;&lt;pre&gt; (cycle start) -- marking -- | -- tracing -- | -- sweeping -- (cycle end)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;First of all, in SlidingView GC, each mutator has two collection phase variables:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Snooping: a mutator's Snooping flag is turned on in the first handshake, and off when the mutator's roots are enumerated. So it is on during marking phase.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;TraceOn: a mutator's TraceOn flag is turned on in the second handshake, and off right before the sweeping phase. So it is on during marking and tracing phases.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;In IBM GC, each mutator has one status variable to remember the handshake status: sync1, sync2 and async. There is a global phase variable indicating collection phase: marking, tracing and sweeping.&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; marking: stage is set marking at the beginning of a collection cycle, before any handshake.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; tracing: stage is set tracing after second handshake and before starting the third handshake; Roots of a mutator are enumerated after second handshake, and before acknowleging the third handshake.&lt;br /&gt;&lt;/li&gt;&lt;li&gt; sweeping: stage is set sweeping after the heap is traced and before it is swept.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;From the description above, we can see the correspondence:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; IBM GC: span of Snooping&lt;br /&gt;   SlidingView: range of marking + part of tracing before mutator enters async (i.e., mystatus != async)&lt;br /&gt;&lt;/li&gt;&lt;li&gt; IBM GC: span of TraceOn&lt;br /&gt;   SlidingView: range of marking + tracing (i.e., mystatus != async || mystatus == tracing)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;This correspondence will be clearer when we examine the write barriers in the two GCs, which is the key difference of the two implementations.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;IBM_write_barrier(SourceObj, slot, NewRef){&lt;br /&gt;  OldRef = *slot;&lt;br /&gt;  if( mystatus != async || mystatus == tracing){&lt;br /&gt;     markGray(OldRef);&lt;br /&gt;&lt;br /&gt;  if( mystatus != async ){&lt;br /&gt;     markGray(NewRef);&lt;br /&gt;&lt;br /&gt;  *slot = NewRef;   &lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;SlidingView_write_barrier(SourceObj, slot, NewRef){&lt;br /&gt;  OldRef = *slot;&lt;br /&gt;  if( mystatus == TraceOn &amp;&amp;amp; SourceObj.color == white){&lt;br /&gt;     ObjCopy = log(SourceObj);&lt;br /&gt;     if( SourceObj.dirty == FALSE){&lt;br /&gt;        addBuffer(ObjCopy);&lt;br /&gt;        SourceObj.dirty = TRUE;&lt;br /&gt;     }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  *slot = NewRef;   &lt;br /&gt;&lt;br /&gt;  if( mystatus == Snooping)&lt;br /&gt;     markGray(NewRef);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The code above clearly shows that they are almost the same at high-level except for some differenes in details:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Order of the operations. In IBM GC, the field update operation is done between after the marking operations, while SlidingView GC udates the field between logging the object and marking the new ref value. I think this difference is trivial, since I believe the order of last two operations can be arbitrary (need proof).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Processing of the old ref value. IBM GC marks the old ref, while SlidingView GC logs the object's original copy. This is the real difference between Sliding-View and DLG on-the-fly GCs, will be discussed more next.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Period of old ref marking. IBM GC marks the old ref till the end of tracing phase, while SlidingView GC logs the object till the object is traced (scanned). This difference is related with last bullet, reflecting the real difference between the nature of Sliding View and DLG GCs.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;The basic theory of Sliding View is the Snapshot-based concurrent GC. The idea is to get a rootsets and heap snapshot, then trace from the rootsets for all live objects in the snapshot. The remaining objects in the snapshot are then garbage. Since the tracing phase is concurrent with mutator execution, any change in the heap snapshot is caught with write barrier so that the collector only traces the original heap snapshot. That is, the original value of the object in the heap snapshot is logged when the object is updated. The original value needs logging only once, and only when the object is not scanned yet (color is white). If the object is scanned before any change, that is the original value of the heap snapshot. In this way, the tracing phase is ensured to terminate since the live objects in the snapshot will gradually all be scanned, thus the write barrier will not produce any new task for tracing. The tracing speed will be approximately equal to the speed of a common non-concurrent tracing. The difference is concurrent tracing traces some floating garbage, since the old ref value might really be overwritten without being written into another visited place (scanned objects or roots).&lt;br /&gt;&lt;br /&gt;Logging the original value of the object has a downside that, not all the reference values in the object will always be updated during the tracing phase. It might incur some redundant copying work. But the advantage is it copies the object old value only once; further updates will not incur any work.&lt;br /&gt;&lt;br /&gt;As a comparison, DLG GC tracks all and only the old ref values that are overwritten. The non-updated ref values are not recorded. But it records all the updates, even to the same slot, as long as the old ref points to a yet-to-scanned object. DLG GC tracing process is sure to terminate as Sliding View as well, for the same reason that all the live objects will be scanned finally.&lt;br /&gt;&lt;br /&gt;Now we see the real differences. In design wise:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;SlidingView GC logs the updated object without checking the updated value. Termination is ensured by the updated object being scanned.&lt;br /&gt;&lt;li&gt;DLG GC logs the updated slot value. Termination is ensured by updated slot value being scanned.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;Performance wise:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;SlidingView GC may have redundant copying work. New objects don't need logging since they are created black, i.e., write barrier for new object is fast.&lt;br /&gt;&lt;li&gt;DLG GC may log multiple updates to the same slot; the updates from the second one do not need logging, and may keep more floating garbage. The scanning-check for the updated slot value may cause cache miss. The updates to new objects need scanning-check for old slot value. This might be a big disadvantage. &lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;So I think both GCs are very very similar. From the discussion above, it might be concluded that, DLG GC is also a sliding-view based GC. This is clearer if it logs only the first time object update, and stops logging once the updated object is traced.&lt;br /&gt;&lt;br /&gt;[1] Hezi Azatchi, Yossi Levanoni, Harel Paz, and Erez Petrank An on-the-fly Mark and Sweep Garbage Collector Based on Sliding Views. Proceedings of the ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'03), October 2003.&lt;br /&gt;[2] Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Elliot E. Salant, Katherine Barabash, Itai Lahan, Yossi Levanoni, Erez Petrank, and Igor Yanover. Implementing an On-the-fly Garbage Collector for Java. The 2000 International Symposium on Memory Management (ISMM00), October, 2000.&lt;br /&gt;[3] See my discussion on &lt;a href="http://xiao-feng.blogspot.com/2007/03/on-fly-mark-sweep-garbabe-collector.html"&gt;On-the-fly mark-sweep garbabe collector&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-1567877101938382906?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/1567877101938382906/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=1567877101938382906' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1567877101938382906'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/1567877101938382906'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/03/comparison-between-two-on-fly-garbage.html' title='Comparison of two on-the-fly garbage collectors'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-3607772642598609248</id><published>2007-03-13T06:09:00.000-07:00</published><updated>2007-04-23T05:16:31.684-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>On-the-fly mark-sweep garbabe collector</title><content type='html'>To have a practical on-the-fly mark-sweep GC, we need two more issues to resolve:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;In the mostly-concurrent mark-sweep GC, there is only one point where all the mutators are suspended simultaneously, i.e., the phase of roots marking. On-the-fly GC needs to remove this sync point.&lt;br /&gt;&lt;br /&gt;&lt;li&gt;To make the algorithm practical, GC needs to support real root sets, those in runtime stack and local storages (registers). Write barrier can't catch the updates in them, extra measure should be taken to deal with it. If the GC is not on-the-fly, this is not an issue since all the roots can be synchronously marked at the stop-the-world suspension point. Note global variables are treated in a whole as an object. The reason is, global variables are shared resource, mutators can access them concurrently. See more discussion below.&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;To remove the stop-the-world sync point of root enumeration, we still need kind of synchronization between collector and mutator to sync their steps.&lt;br /&gt;&lt;br /&gt;Firstly we need a protocol to tell the start/finish of a GC cycle. Originally this can be achieved with the stop-the-world sync point naturally. Now if we don't want stop-the-world, we can use a global flag to indicate the phases. That is, when the collector is going to start a GC cycle, it sets the flag; when all the mutators see the flag, they will respond so that the collector can proceed to next phase. This is called "handshake" by Doligez and Gonthier in their DLG algorithm [1, 2].&lt;br /&gt;&lt;br /&gt;A handshake goes in this way:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Collector sets the flag, waiting all mutators to respond;&lt;br /&gt;{&lt;br /&gt;&lt;div style="margin-left: 40px;"&gt;   flagC = 1;&lt;br /&gt;for each mutator M&lt;br /&gt;wait for flagM = 1;&lt;br /&gt;&lt;/div&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;A mutator sees the flag, and responds to the collector.&lt;br /&gt;{&lt;br /&gt;&lt;div style="margin-left: 40px;"&gt;    if( flagC == 1)&lt;br /&gt;flagM = 1;&lt;br /&gt;&lt;/div&gt;}&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;For handshake, the collector can't set the local flags for the mutators. The reason is simple: to avoid the race condition when the mutator is mutating the heap, e.g., copying a large array. It must be the response from the mutator so that to guarantee it is ready for the GC cycle. The real implementation for the handshake can be flexible as long as the race condition is avoided. An easy alternative is for the collector to suspend the mutator and set its local flag, assuming the mutator can never be suspended in critical region.&lt;br /&gt;&lt;br /&gt;Once a handshake is done, the collector can start to mark the roots of the mutators, trace and sweep the heap. The intention for the handshake is to inform the mutators to turn on write barriers. Only when all the mutators' write barriers are turned on, can the root marking be started; otherwise some live objects might be lost, e.g., an object reference added to a scanned rootset and removed from a yet-to-scan rootset. The write barrier should be on until the trace phase is finished.&lt;br /&gt;&lt;br /&gt;Trace phase is finished when no gray objects can be found in the heap. This is guaranteed to terminate, since the heap live objects number cannot increase except for the newly created objects. But new objects are created black, all the gray objects will finally scanned.&lt;br /&gt;&lt;br /&gt;When the trace phase is done, we know we have all the live objects marked (including floating garbage). We can turn off write barrier, since now we care about only the dead objects, those to be swept. During sweep phase, any changes in heap can't resurrect a dead object, and can't lose a live object (all live objects are marked). The only thing remaining is to always create new objects marked.&lt;br /&gt;&lt;br /&gt;In the description above, we assume the roots can be monitored by write barrier as normal object fields. In other words, we treat the mutators' runtime stack, local storages, and global reference variables as objects. But this is untrue since it's too expensive to catch the updates in those areas, if possible. So we need special treatment for roots marking process, to guarantee there is not live object is lost.&lt;br /&gt;&lt;br /&gt;A live object could be lost during roots marking in following cases:&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;A root ref is added in a mutator's stack that is enumerated already, while the same root ref is removed in another mutator's stack that is yet to be enumerated. If the stack is an object, the removed reference can be caught by write barrier for field update. Fortunately, the reference copying from one stack to another is impossible in Java.&lt;br /&gt;&lt;li&gt;The ref of a newly allocated object is written into an enumerated stack. This object has to be created marked.&lt;br /&gt;&lt;li&gt;A root ref is moved from one stack to another as 1 above, but through global variable or heap object. That is, one mutator writes it to a field, the other one read it into its stack. Then the first mutator removes the ref from its stack, and the field is overwritten by another value. Normally this root ref can be caught by write barrier for the field update when it is overwritten. But there is a special case where it will be lost.&lt;br /&gt;&lt;/ul&gt;Now let's have a look at the special case. So far, we actually have an imprecise assumption that write barrier is an atomic operation. But it's not in reality. When multiple mutators execute write barrier at the same time, some intermediate ref write can be lost. See the simplified write barrier code:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;write_barrier( SourceObject, Slot, NewRef){&lt;br /&gt;  OldRef = *Slot;&lt;br /&gt;  mark(OldRef);&lt;br /&gt;  *Slot = NewRef;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When two mutators execute simultaneously the interleaved instructions, the first mutator's slot write may happen between the second mutator's slot read and write; then the first mutator's NewRef1 cannot be caught by any write barrier, as shown below.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;  OldRef = *Slot;&lt;br /&gt;                            OldRef = *Slot;&lt;br /&gt;  mark(OldRef);&lt;br /&gt;                            markGray(OldRef);&lt;br /&gt;  *Slot = NewRef1;&lt;br /&gt;                            *Slot = NewRef2;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If a third mutator happens to read the slot between the two writes, it may read the value of NewRef1 and put it in its runtime stack that has been enumerated. Thus a live object is lost.&lt;br /&gt;&lt;br /&gt;To solve the problem, an easy approach is taken. The write barrier just remembers NewRef1 as below.&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;write_barrier( SourceObject, Slot, NewRef){&lt;br /&gt;  OldRef = *Slot;&lt;br /&gt;  mark(OldRef);&lt;br /&gt;  *Slot = NewRef;&lt;br /&gt;  mark(NewRef);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This increases the write barrier overhead a lot, but since the extra mark(NewRef) operation is only needed during roots marking phase, hopefully the roots marking phase is relatively short (compared to the trace phase when write barrier is on).&lt;br /&gt;&lt;br /&gt;[1] D. Doligez and X. Leroy. A concurrent generational garbage collector for a multi-threaded implementation of ML. In Conference Record of the Twentieth Annual ACM Symposium on Principles of Programming Languages, ACM SIGPLAN Notices. ACM Press, January 1993.&lt;br /&gt;[2] D. Doligez and G. Gonthier. Portable, unobtrusive garbage collection for multiprocessor systems. In Conference Record of the Twenty-rst Annual ACM Symposium on Principles of Programming Languages, ACM SIGPLAN Notices. ACM Press, 1994.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-3607772642598609248?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/3607772642598609248/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=3607772642598609248' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3607772642598609248'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/3607772642598609248'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/03/on-fly-mark-sweep-garbabe-collector.html' title='On-the-fly mark-sweep garbabe collector'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-4997379153096990014</id><published>2007-03-13T03:37:00.000-07:00</published><updated>2007-04-23T05:16:31.682-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><title type='text'>Concurrent mark-sweep garbabe collector</title><content type='html'>Recently I read two papers on concurrent GC design [1, 2]. The algorithms are called "on-the-fly" GC because the mutators are not required to be suspended simultaneously for any synchronization operation.  Sometimes, the mutators are suspended one by one. Normally "on-the-fly" is better than the "mostly concurrent" GC since it is usually time consuming to stop all the mutators, especially with multiple or many mutator threads, which is not uncommon in today's Java applications.&lt;br /&gt;&lt;br /&gt;Both of the papers study on-the-fly mark-sweep GC. It's a little bit easier to design a concurrent GC with mark-sweep collection algorithm because it doesn't move objects. As we know, the complexities of concurrent GC lies in the interactions between mutator(s) and collector(s).&lt;br /&gt;&lt;ol&gt;&lt;li&gt;During the collection process, a mutator may &lt;span style="font-style: italic;"&gt;mutate &lt;/span&gt;the heap by writing objects.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;During application running, a collector may move the objects.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;If the collector never move objects, the interactions will be much simplified, since we only need to consider the situation when the mutator(s) concurrently update reference fields of the objects. For this situation, to guarantee the algorithm correctness is simple;  write barrier suffices to catch the updates and act accordingly.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Safety. Never lose any live object;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Liveness. Dead objects will ultimately be reclaimed.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;As a basic theory for concurrent GC, we can find all the garbage if we can get the rootsets of the mutators at a single time point. Or in order words, at any time point, if we know the mutators' rootsets, all the live objects will consist of those traced from the rootsets. If we let the mutators continue execution, they will only change the objects in two ways: 1. create new objects; 2. make live objects dead. They will never make dead object live. So if we can get the rootsets of a time point (rootsets snapshot), we can have a concurrent GC by keeping those objects traceable from the rootsets and those newly allocated. This will keep some floating garbage, but that's ok if we can prove the liveness property of the algorithm.&lt;br /&gt;&lt;br /&gt;Assuming we don't use an on-the-fly GC, i.e., the GC can suspend and mark all the mutators' roots by stop-the-world, the concurrent algorithm would be sort of straightforward. With all the roots marked, the mutators can be resumed with write barrier turned on, and the collector can scan the heap to identify all the live objects as below.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;mark (gray) and scan (black) from the roots transitively;&lt;/li&gt;&lt;li&gt;Any update in a reference field is caught and the old value (original referenced object) is marked;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Newly allocated objects are created marked and scanned.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;When there is no gray (marked but not scanned yet) object in the heap, the collector can start to sweep the heap, and clear the live objects' mark status.&lt;br /&gt;&lt;br /&gt;The situation becomes complicated when on-the-fly is introduced.&lt;br /&gt;&lt;br /&gt;[1] Hezi Azatchi, Yossi Levanoni, Harel Paz, and Erez Petrank  &lt;a href="http://www.cs.technion.ac.il/%7Eerez/Papers/ms-sliding-views.ps"&gt;An on-the-fly Mark and Sweep Garbage Collector Based on Sliding Views&lt;/a&gt;. &lt;i&gt;Proceedings of the ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications  (OOPSLA'03)&lt;/i&gt;, October 2003.&lt;br /&gt;[2] Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Elliot E. Salant, Katherine Barabash, Itai Lahan, Yossi Levanoni, Erez Petrank, and Igor Yanover. &lt;a href="http://www.cs.technion.ac.il/%7Eerez/Papers/cgc9.pdf"&gt;Implementing an On-the-fly Garbage Collector for Java&lt;/a&gt;. &lt;i&gt;The 2000 International Symposium on Memory Management (ISMM00)&lt;/i&gt;, October, 2000.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-4997379153096990014?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/4997379153096990014/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=4997379153096990014' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4997379153096990014'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/4997379153096990014'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/03/concurrent-mark-sweep-garbabe-collector.html' title='Concurrent mark-sweep garbabe collector'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8812827877261189081.post-2650558180078073890</id><published>2007-03-01T21:49:00.000-08:00</published><updated>2007-04-23T05:17:08.171-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Garbage collection'/><category scheme='http://www.blogger.com/atom/ns#' term='JVM'/><category scheme='http://www.blogger.com/atom/ns#' term='Apache Harmony'/><title type='text'>Concurrent GC for Apache Harmony?</title><content type='html'>Below is what I posted to Harmony dev mailing list about a low pause GC proposal:&lt;br /&gt;&lt;br /&gt;Harmony now has a reasonably advanced and stable parallel/generational GC built for 32bit platforms (the GCv5). The remaining work for GCv5 I think is mainly about 64bit port and leverage of large heap size enabled by 64bit, while performance tuning is always a continuous effort.&lt;br /&gt;&lt;br /&gt;Besides the ongoing work of GCv5, I would like to start thinking of a low-pause garbage collector for Harmony now, since some Harmony users might expect their applicaitons' execution interrupt for garbage collection to be as short as possible. For them, the "throughput" of GC is not all they want. GC's "pause time" or "latency" or "response time" is critical as well.&lt;br /&gt;&lt;br /&gt;Low-pause GC usually means "concurrent GC", in contrast to "stop-the-world GC". In concurrent GC, the mutators (application threads) can keep running while collectors (GC threads) are doing garbage collection. GCv5 so far is a "stop-the-world" GC, where all the mutators are suspended when a collection is started.&lt;br /&gt;&lt;br /&gt;The concept "parallel" is orthogonal to "concurrent". "Parallel" GC refers to that a collection can be conducted by multiple collector threads simultaneously. "Generational" is orthogonal as well.&lt;br /&gt;&lt;br /&gt;There is a claimed "pauseless GC" by Azul Systems [1], which depends on Azul's specific hardware support for read/write barriers. Without HW support, read barriers can be expensive [2]; but I think a very-short-pause-time GC is acceptable for Harmony, at least good enough in the near future.&lt;br /&gt;&lt;br /&gt;Some researchers seperate "on-the-fly" GC from concurrent GC as a special case [3]. The difference as stated is "on-the-fly" GC doesn't require any synchronization point where all mutators are suspended, i.e., it suspends and resumes mutators one after another, not at the same time. There is also "real-time" GC proposed that can satisfy required real-time bounds. Metronome is one example [4].&lt;br /&gt;&lt;br /&gt;Considering the support in available platforms and Harmony's objectives, an on-the-fly GC might be our choice. But before that, we can have a traditional concurrent GC implemented, and adapt it into on-the-fly.&lt;br /&gt;&lt;br /&gt;Anybody has good advices? Thanks.&lt;br /&gt;&lt;br /&gt;[1] &lt;a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.usenix.org/events/vee05/full_papers/p46-click.pdf" target="_blank"&gt;http://www.usenix.org/events&lt;wbr&gt;/vee05/full_papers/p46-click&lt;wbr&gt;.pdf&lt;/a&gt;&lt;br /&gt;[2] &lt;a onclick="return top.js.OpenExtLink(window,event,this)" href="http://cs.anu.edu.au/%7ESteve.Blackburn/pubs/papers/wb-ismm-2004.pdf" target="_blank"&gt;http://cs.anu.edu.au/~Steve&lt;wbr&gt;.Blackburn/pubs/papers/wb-ismm&lt;wbr&gt;-2004.pdf&lt;/a&gt;&lt;br /&gt;[3] &lt;a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.cs.technion.ac.il/%7Eerez/Papers/ms-sliding-views.ps" target="_blank"&gt;http://www.cs.technion.ac.il/&lt;wbr&gt;~erez/Papers/ms-sliding-views&lt;wbr&gt;.ps&lt;/a&gt;&lt;br /&gt;[4] &lt;a onclick="return top.js.OpenExtLink(window,event,this)" href="http://www.research.ibm.com/people/d/dfb/papers/Bacon03Realtime.pdf" target="_blank"&gt;http://www.research.ibm.com&lt;wbr&gt;/people/d/dfb/papers/Bacon03Re&lt;wbr&gt;altime.pdf&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8812827877261189081-2650558180078073890?l=xiao-feng.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://xiao-feng.blogspot.com/feeds/2650558180078073890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=8812827877261189081&amp;postID=2650558180078073890' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2650558180078073890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8812827877261189081/posts/default/2650558180078073890'/><link rel='alternate' type='text/html' href='http://xiao-feng.blogspot.com/2007/03/concurrent-gc-for-apache-harmony.html' title='Concurrent GC for Apache Harmony?'/><author><name>Xiao-Feng Li</name><uri>http://www.blogger.com/profile/08325404561142470262</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry></feed>
