Tuesday, August 4, 2009

Transactional memory regarded not good programming model

After I posted my blog entry Can "transactional memory" be a new programming model? two years ago, I have been watching closely on the opinions from academia on TM as a programming model. Recently I read two papers on the same topic, one from Bohem [2] arguing that TM should only be an implementation technique, the other from C. J. Rossbach, et al [3] demonstrated with experiments that programming with TM is not necessarily easier than with coarse-grained locks.

[1] Can "transactional memory " be a new programming model? http://xiao-feng.blogspot.com/2007/04/can-transactional-memory-be-new.html
[2] Hans-J. Boehm, Transactional Memory Should Be an Implementation Technique, Not a Programming Interface, HotPar, Berkeley, CA. March 30, 2009
[3] Christopher J. Rossbach, Owen S. Hofmann, and Emmett Witchel, Is Transactional Programming Actually Easier? 8th Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD), Austin, Texas June 2009

Tuesday, April 14, 2009

A quick guide on Tick - the Harmony concurrent GC

I've written a slide deck on Tick [1], the Harmony concurrent GC we developed. Tick has been there in Harmony for about one year, now I got some time to put down its design and implementation. I do not expect people can immediately understand all the internals of Tick after reading the guide, but it should help those who want to dive into Tick or those who want to write their own concurrent GC. The immediate target of this document to help the GSoC2009 project with Tick.

Concurrent GC is a super-set of stop-the-world GC, in my opinion. It meets all the challenges in a STW GC and many beyond. The key challenges in my mind (based on my experience with Tick) are:

1. The interaction between mutators and collectors. Here in my slides I refer as the phase transition control. The idea was not so clear at the beginning of Tick development, but we then realized it is simply boiled down into a central state-machine control by mutator.

2. The termination control. It is easy to understand that to terminate the marking process, we need guarantee the global root set, collector local mark stacks, mutator local remember sets, and global remember set all be empty. But there are two subtleties in real implementations. a) the checks must be in order; b) concurrent access to mutator local remset by mutator and collector.

3. The collection triggering scheduler. It can not collect too early so as to waste the system resource when there are lots of free memory; also it can not be too late to become virtually STW collection, hence losing all the Tick design target. A proper triggering scheduler should consider both space and timing issues.

4. Keep collection overhead small. Concurrent GC wants to achieve short pause time, the expense is to lower the overall system throughput. It is a challenge to balance the design between pause time and system throughput. For example, to improve the performance, multiple collectors can be deployed in parallel.

I would like to spend more time to document the internals of Tick. Stay tuned.

[1] http://people.apache.org/~xli/presentations/harmony_tick_concurrent_gc.pdf

Wednesday, April 1, 2009

Two Harmony project proposals for GSoC 2009

Check link below [1] for application process. For Harmony projects, it is required to discuss the proposal in the mailing list dev@harmony.apache.org .

1. Modularize Harmony JIT by separating JET as a standalone JIT compiler

So far the JIT component (called Jitrino) of Harmony has virtually two JIT implementations: JET and OPT. Jitrino.JET is a fast but non-optimizing JIT, and Jitrino.OPT is an optimizing JIT. The code base of JET and OPT shares lots of code hence they are mixed in one module. This is undesirable for situations where people need only JET, for fast compilation, for small footprint. This project proposes to create a standalone JET-based JIT module for Harmony. It does not require to remove JET from Jitrino, but to create a new JIT module with JET. This project is also a very good exercise to examine the JIT modularity design, the interface between JIT and other components, the interaction between multiple co-existing JIT modules.

2. Implement WeakReference support in Harmony concurrent GC

Harmony already has a concurrent GC (called Tick, with three concurrent GC algorithms). It runs well with standard benchmarks. The only remaining unfinished feature is WeakReference support . Weakly referenced object (i.e., referent) is accessed through get() interface. That means, get() operation can make a weakly reachable referent strongly reachable. During concurrent collection, the system must monitor the get() operation to catch this change of reachability, otherwise the referent could be reclaimed. This project also includes to integrate the WeakReference processing with Finalization process. Other optimizations in Tick are also desirable, such as to reduce the amount of floating garbage.

[1] http://wiki.apache.org/general/SummerOfCode2009

Saturday, August 23, 2008

Thread mapping: 1:1 vs M:N

This is an old essay I wrote about 7 years ago, when I was investigating what kind of thread mapping is better for a runtime system (especially the ORP [1] JVM I was working on). It is worth to put it here as one of the series of blog articles. I edited it a little bit to incorporate some new information.

I discussed the concept of thread mapping, 1:1 or M:N, in blog entry: What is a thread. [2] M:N thread binding is "conceivably" better than 1:1 because of cheap context switch. It is believed to be cheap because it does not need to involve OS kernel operations for user level thread switch. So an M:N threading normally has following setting: there are processor number pf kernel threads; unlimited number of user threads multiplexes over the kernel threads. Since kernel context switches are expensive, it's expected there are no kernel thread context switches at all except OS kernel process scheduling. All the context switches are expected to happen only in the user-level scheduler. As a comparison, in 1:1 threading all the context switches happen in kernel.

1. Context Switch
But this conception is not necessarily true. The kernel thread context switch has no more operations than user thread context switch, except that, kernel switch requires to trap into the kernel. So the only additional overhead is the system call. But if we study the real nature of context switch, we can see the truth is not so obvious.

Scenario 1: Blocking operations
Context switches mostly happen when a thread is doing some I/O operations and blocks in kernel. Since it blocks in kernel, it's natural to schedule on-site by the kernel to context switch to another kernel thread; this is the approach of 1:1 binding. The trapping overhead is there already.

In M:N mapping, there are two ways to continue another user thread in the same kernel thread context:

1. Scheduler activation implemented in kernel. When a blocking happens, kernel can vector the event to user scheduler via an upcall. Then the user scheduler can determine which user thread to schedule next. But this is cumbersome because it requires one upcall which is an extra overhead compared to the kernel scheduling; and the user scheduler is not necessarily as efficient as kernel scheduler due to the immaturity compared to the OS kernel shceudler;

2. Non-blocking system call. If OS doesn't provide scheduler activation, the only way to continue a user thread when another blocks is to use non-blocking system call. That is, whenever a thread is calling a possibly blocking syscall, it first polls it to check if it is going to block. If it is, the user scheduler will schedule another thread, and polls again in next scheduling cycle. This solution seems ok, but the polling itself is a system call, which involves everything in a blocking system call except the context switch. So user-level scheduling consists of a non-blocking system call + user thread context switch. The cost is equal to a blocking system call + kernel context switch.

Although it is no better, this is the ideal case of user-level scheduling. In common cases, the non-blocking simulation of blocking syscall involves much more than stated here. It usually requires a syscall to save the current blocking status of the file descriptor and another syscall to restore it, before and after the polling respectively. And the next scheduling cycle may repeat these operations again to check if the blocking condition is resolved. As a comparison, kernel scheduling is much simpler and cleaner, where the blocking thread is simply put into sleep, and then waken up till the blocking condition is known resolved.

The reason for this complexity of non-blocking syscall simulation is easy to understand: user scheduler knows nothing about kernel status. It has to use syscall to figure out the blocking status. Better asynchronous I/O support can help to solve the problem, where the user scheduler is notified by the kernel for the async processing, and a completion notification is sent by the kernel upon the processing is finished. This kind of interaction between user threads and kernel is implemented in Windows NT4.0 and Solaris10 and their later as I/O Completion Port.

Scenario 2: Synchronization
Another scenario of thread blocking is caused by the synchronizations among threads. This is the case where all the scheduling can be done at user level, since the synchronization implementation could all be in the user runtime. If the synchronization usage is extensive in the application and the contention is intensive, user level scheduling can bring some performance advantage. But again, it is not that simple.

1. Normally, in the application domain, a highly contending application is unusual and considered sub-optimal in its design. For example, the highly contended SPECpbob was substituted by rarely contended SPECjbb2000. (Note, lots of synchronization operations do not mean lots of contention. They are completely two different issues.) Actually, some source said the thread blockings caused by synchronization take only very minor ratio in all the blockings (< 5%).

2. Even it is good to use user-level scheduling for thread synchronization blocking, a threading library has to deal with process level mutex, which makes the thread synchronization implementation complicated and less efficient as expected. (Futexes helped here.)

This is the scenario supporting the idea of M:N mapping; well the benefits are not convincing enough.

Scenario 3: Uncaught exceptions
There are lots of situations where thread blockings are hard to be simulated with non-blocking operations, e.g. page fault. Page fault happens in kernel, and is hard to predict. Neither scheduler activation nor a non-blocking wrapper can work around it easily except just be blocked in kernel. This limits the concurrency achievable.

As stated earlier, the problem is due to the limited information known at user level. Lots of things like processor affinity, intelligent scheduling for hyperthreads or NUMA optimizations, scheduling based on system load outside of the process, etc. are almost impossible. Scheduler activation can help but not a complete solution.

Moreover, a problem is, all of the scheduling supports are always existing in kernel scheduler. To achieve the scheduling capability at user-level, those supports have to be duplicated at user-level.

2. Cooperative user threads
Even if the user-level scheduling is effective, is it really a good thing? People may think that user scheduler is better because the user threads are cooperative, and they guess that "cooperative" sounds like a better behavior pattern than preemptive.

Number of scheduling units
It is easy to believe that best performance is normally achieved when there are equal number of processors and scheduling units (kernel threads). User threads sharing a kernel thread are bound to a processor, and do not need preemptiveness. But this problem is, in nowadays operating system, there are always many running processes at the same time. So even if user threads are cooperative, the kernel thread containing them is already preempted from time to time.

Thread cooperation
It could really be a good thing if user threads are really cooperative without much extra cost. Unfortunately, since a non-preemptive scheduler depends on application behavior for scheduling, it normally requires the application be aware of the scheduling and yields voluntarily; Otherwise it is easy to be deadlocked or starved. But this awareness is burdensome to the developer and usually does not exist at all. The threading runtime can not always be able to insert yields properly.

Different from many people's expectation, the preemptive scheduling is more intuitive to application developer than the cooperative one. (The situation is different from the mechanism of stop-the-world for GC in runtime system. I will discuss it later).

More importantly, preemption might be necessary for any environment where real-time or soft real-time responsiveness is needed, both for rouge-thread cases and for cases where the processor resource has to be given up due to a higher priority thread needing it.

Resource sharing
An advantage of user threads in one kernel thread might be that, the threads can share everything of the system resource, esp. including the page table and TLB slots, which could lead to better performance due to less misses. Well, the situation is, kernel threads can share these resources too. (This feature needs architectural support. AFAIK, Linux does pretty good on IA32 processor.)

Thread creation/cancellation
One important claimed advantage of user thread is the creation and cancellation cost of user thread is low. This is mostly true if the thread runtime is implemented correctly. But its importance is largely reduced due to its user thread nature. Being not a separate system scheduling unit, it is unclear what is the intention to create tons of user threads that are only running within one kernel thread container. They will compete for the precious time slice of the single kernel thread.

In reality, most systems do not create or cancel lots of threads in short time. In cases when lots of threads are needed, the applications may want to reduce the creation overhead with thread pooling. Lots of threads creation doesn't necessarily mean lots of concurrency. They may just start and finish frequently. And even if there are lots of concurrency, it's hard to be leveraged because it has to be implemented by scheduling efficiency, which is not so good for user-level threading as we discussed above.

3. Obvious disadvantages
In spite of the suspicious "advantages" of M:N threading discussed above, user-level scheduler has some obvious disadvantages.

Duplication of the schedulers
M:N requires two schedulers which basically do same work, one at user level and one in kernel. This is undesirable. It requires frequent data communications between kernel and user space for scheduling information transference.

One subtler point is, the duplication takes more space in both Dcache and Icache for scheduling than a single scheduler. It is highly undesirable if cache misses are caused by the schedulers but the application, because a L2 cache miss could be more expensive than a kernel thread switch. Then the additional scheduler might become a trouble maker! In this case, to save kernel trappings does not justify a user-scheduler, which is more truen when the processors are providing faster and faster kernel trapping execution.

Thread local data maintenance
M:N has to maintain thread specific data, which are already provided by kernel for kernel thread, such as the TLS data, error number. To provide the same feature for user threads is not straightforward, because, for example, the error number is returned for system call failure and supported by kernel. User-level support degrades system performance and increases system complexity.

System info oblivious
Kernel scheduler is close to underlying platform and architecture. It can take advantage of their features. This is difficult for user thread library because it's a layer at user level. User threads are second-order entities in the system. If a kernel thread uses a GDT slot for TLS data, a user thread perhaps can only use an LDT slot for TLS data. With increasingly more supports available from the new processors for threading/scheduling (Hyperthreading, NUMA, many-core), the second order nature seriously limits the ability of M:N threading.

This is what I thought on the thread mapping issues. It is a long article, and thanks for reaching here. There are some contents in the original essay that I do not include here. Those are some considerations on the threading needs in a runtime system, such as whether we need suspend/resume API, better inter-thread signaling, etc. I might discuss them in future.

[1] Open Runtime Platform, http://orp.sourceforge.net
[2] What is a thread? http://xiao-feng.blogspot.com/2008/08/what-is-thread.html

Sunday, August 17, 2008

What is thread escape analysis?

In last blog entry, I discussed "thread local data" [1]. I mentioned that if some data are known to be thread local, some optimizations can be applied to the application manually or automatically.

One optimization is, we can put the thread local data into registers. Register is thread local anyway, so the semantics are kept. Register access is much faster than memory access, so this optimization can improve the performance. In compiler, this is done with register allocation, scalar replacement (i.e., use scalars to replace non-scalar data fields, such as array or class), etc.

Thread local data can be be put on runtime stack as well. Stack is also in memory as heap space, but stack has an advantage that data on it are freed automatically when the stack frame is cleared (when a method returns). In this case, the data actually are not only thread local, but also method local, which is more strict condition than thread local. (some people claim that stack-allocated data have better access locality, but my experience did not confirm that.)

Even if the thread local data can not be put into register or stack, there are still optimizations applicable. For example, garbage in thread local data can be recycled without stopping other threads for root enumeration. (Well, technique still needed to enumerate roots in global variables).

No matter where they are, all the thread local data have an important optimizing opportunity: They do not need any locking operations for mutual exclusive access. This is important because locking operation usually is very expensive.

Then the question is how the compiler can identify some data are thread local automatically. This is called escape analysis. The compiler analyzes the source code to find if the reference of an object is passed to other thread. This analysis has to be conservative to guarantee the correctness. For example, usually when a reference is written to a global variable, it is considered escaping, because the global variable could be read be other thread.

Escape analysis can also be conducted at runtime without compiler static analysis. That is, when an object is created, it is set thread local to its creating thread. Then the system monitors all the accesses to this object. If it detects any other thread tries to access the object, the system then marks the object to be escaping. This technique is called escape detection instead of escape analysis sometimes. Some of the thread local data optimizations can still be applied to the dynamically detected thread local data, but some static optimizations might not be suitable.

"Thread local" actually is not necessarily restricted to "accesses". Many operations can be the property of "thread local". For example, if an object is only locked by one thread, it is called thread local lock. Thread local lock can be eliminated, even the object itself is accessed by multiple threads. So thread local data is more restrictive than thread local object.


[1] What is thread local data? http://xiao-feng.blogspot.com/2008/08/what-is-thread-local-data.html

What is thread local data?

Thread local data refer to those data owned by one thread. Normally "owned" here means the data are only accessed by that thread. Sometimes, it is also called thread-private data, thread-specific data, thread-local storage, etc. Thread local data are interesting because the property of being owned by a single thread can be utilized to improve the application's performance or the program design.

There are basically three kinds of thread local data. They are:


  • Registers. The registers can only be accessed by single thread (or process, depending on the context) normally. (Yes, some processors have global registers, and even the common registers can be accessed with tricks, but those are out of the scope of my discussion here.)
  • Stack. This is known to be associated with a specific thread. Sometimes a thread is identified by its stack.
  • Thread-local heap. Within the shared heap space, a region can be owned by a single thread.


Registers and stack are system-supported thread local data, as we discussed in "What is a thread?" [1]. They cannot be accessed by other threads by design.

Thread-local heap is different. It is supported not by design, but by convention, because heap is sharable to all threads. A heap region is local to one thread means either of the following two situations:
1. The region is not accessible to other threads. The region can be protected by virtual memory mechanism or whatever technique to enforce the convention, or it is simply a rule complied by all the threads.
2. The region is accessible to all threads, but only one thread actually accesses it (up to the moment). This property of the data is called "non-escape", i.e., they are confined to a single thread's territory. Once the data are accessed by other thread, it becomes "escape". (We will discuss "escape" and "escape analysis" later.)

Different from the registers or the stack, there is no default system support for thread local heap. Programmers need some way to define, to find, and to use thread local heap. Since every thread can claim thread local regions, they should be able to find its own regions with same API like my_region(). The solution is, we can put the region pointer into the same register or the same stack slot of different threads. Since the registers and the stack are system-supported thread local data, even using the same register name or stack slot, different threads will access their own registers or stack slots. So they can get their own region pointers from the same API my_region(). This is how the current thread libraries implement "thread local storage" or "thread specific data".


[1] What is a Thread? http://xiao-feng.blogspot.com/2008/08/what-is-thread.html

What is a thread?

Some of my friends are confused by all kinds of concepts around thread, such as system thread, kernel thread, native thread, user-level thread, application thread, java thread, software thread, hardware thread, simultaneous multithread (SMT), hyperthread (HT), helper thread, etc, etc.

So what a thread is?

A thread is nothing but a control flow of execution. It is an concept only valid in control-flow machine, because only then is there control flow. Then what is control flow? Or in other words, how to represent a control flow. In my opinion, only two entities are essential to represent a control flow. They are the program counter and the stack.

Program counter points to next instruction to execute. Stack stores the temporary execution result. To be a meaningful stack, a stack pointer is needed pointing to the next location to store the execution result.

Program counter and the stack can uniquely identify a control flow of an execution. They cannot be shared with other threads (except for some extreme cases). All of other computing resources can be shared between threads, such as heap, code, processor, etc. Hence, program counter and the stack are called the thread context.

That means, if a system provides threading support, it should at least provide a way to distinct one thread context from another, be it in software, hardware or hybrid. If processor hardware provides thread context support, it is hardware thread. Different hardware threads can share same processor pipeline (SMT) or use different pipelines, depending on the design. HT is an implementation of SMT. Any control-flow processor must provide at least one thread context; otherwise, there would be no control flow.

If the processor has only one thread context, threading support still can be provided by software. That is, multiple software threads can multiplex over the same thread context. When a software thread is scheduled to run, its context is loaded from the memory into the hardware thread context. If it is scheduled off the processor, its context is saved into the memory.

Software thread design has an implication. Since the thread context loading/storing (or switch) is conducted by software, it requires the software thread design to guarantee that there are chances to conduct the switch operation (or thread scheduling). An easy way to implement is to leverage hardware interrupt. Once the software receives a hardware interrupt (timer or whatever), it executes the interrupt handler and within the handler, it schedules the threads.

Sometimes, the timer is too long to wait. For example, when a thread is sleeping, before a timer handler is executed, no other thread can be scheduled. This is not desirable. A straightforward solution is, if a thread wants to sleep, it always invokes the scheduler, then the scheduler can switch on another thread.

Now that multiple software threads can share the same hardware context, it is not hard to think that, a software thread context can also be multiplexed by another level of multiple software threads. This is true. So conceptually, software threads can be built with infinite levels, every higher level threads multiplex the contexts of its next level threads.

Also a natural corollary is, a thread is only a thread in your level of discussion. It could contain multiple threads in a higher-level of discussion. Well, although this is true, people do not really build many levels of software threading libraries. Usually there are only two levels, one level shares the hardware context, and the other level shares the software context.

This is reasonable. The most important reason is, all the software threads in one level are treated as a single thread in the next level, so they are scheduled as one thread in the next level. That means, they total only share the time slice of a single thread in the next level. If the next level thread is scheduled off the processor, none of them can be continuing. This is inconvenient.

More inconvenient is, sometimes, only one thread wants to sleep, but all the other threads have to sleep with it together, because they are treated as a single thread in the next level scheduler who sees the sleep operation. This issue can be partially solved with non-blocking sleep. That is, when a thread wants to sleep, it does not really sleep in the sense of the next level scheduler. It only sleeps in the eyes of its level's scheduler. This scheduler will schedule another thread at the same level. From the next-level thread scheduler's point of view, the thread is just continuing without sleep at all. In threading terminology, all the blocking operations (such as sleeping) in one level are implemented as non-blocking in its next level. (Well, this requires the system support for non-blocking operations, such as socket snooping, etc.)

Only operating system kernel really takes control of the execution engine (i.e., the processor or the pipeline). So the time slice concept is only really meaningful to the kernel. That means, only the threading at kernel level can really manipulate all the resources. Higher levels of software threads should always try to leverage the support of kernel threading. This is the fundamental reason why we want at most one additional level of threading above kernel threads.

Kernel threads are exposed to user applications through threading APIs. They are called native threads by the applications, such as NPTL or Linuxthreads in Linux, and WinThreads in Windows. The threading library implemented on top of native threads is called user-level threading. For the inconvenience we discussed above, not so many software today employ user-level threads.

User-level threads have its own advantages in certain scenarios. For example, multiple user threads never run in parallel on multiple processors/cores, because they are actually just single thread from OS' point of view.

Java thread is thread in another dimension. It is actually a language concept. It can implemented in any of threading mechanisms discussed above. Previously before Java, all the threading supports are kind of independent of programming languages. They are just system supports, and any languages can utilize if they want. Java takes a different approach that, it builds threading concept in its language. This is important for program semantic correctness. Hans had a PLDI paper with title "Threads Cannot be Implemented as a Library" [1]. And people are trying to introduce threading as a language construct into more languages.

[1] Hans Boehm, Threads Cannot be Implemented as a Library, www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf