Best Practices for cache locality in Multicore Parallelism in F#

I'm studying multicore parallelism in F#. I have to admit that immutability really helps to write correct parallel implementation. However, it's hard to achieve good speedup and good scalability when the number of cores grows. For example, my experience with Quick Sort algorithm is that many attempts to implement parallel Quick Sort in a purely functional way and using List or Array as the representation are failed. Profiling those implementations shows that the number of cache misses increases significantly compared to those of sequential versions. However, if one implements parallel Quick Sort using mutation inside arrays, a good speedup could be obtained. Therefore, I think mutation might be a good practice for optimizing multicore parallelism.

I believe that cache locality is a big obstacle for multicore parallelism in a functional language. Functional programming involves in creating many short-lived objects; destruction of those objects may destroy coherence property of CPU caches. I have seen many suggestions how to improve cache locality in imperative languages, for example, here and here. But it's not clear to me how they would be done in functional programming, especially with recursive data structures such as trees, etc, which appear quite often.

Are there any techniques to improve cache locality in an impure functional language (specifically F#)? Any advices or code examples are more than welcome.


As far as I can make out, the key to cache locality (multithreaded or otherwise) is

  • Keep work units in a contiguous block of RAM that will fit into the cache
  • To this end ;

  • Avoid objects where possible
  • Objects are allocated on the heap, and might be sprayed all over the place, depending on heap fragmentation, etc.
  • You have essentially zero control over the memory placement of objects, to the extent that the GC might move them at any time.
  • Use arrays. Arrays are interpreted by most compilers as a contiguous block of memory.
  • Other collection datatypes might distribute things all over the place - linked lists, for example, are composed of pointers.
  • Use arrays of primitive types. Object types are allocated on the heap, so an array of objects is just an array of pointers to objects that may be distributed all over the heap.
  • Use arrays of structs, if you can't use primitives. Structs have their fields arranged sequentially in memory, and are treated as primitives by the .NET compilers.
  • Work out the size of the cache on the machine you'll be executing it on
  • CPUs have different size L2 caches
  • It might be prudent to design your code to scale with different cache sizes
  • Or more simply, write code that will fit inside the lowest common cache size your code will be running on
  • Work out what needs to sit close to each datum
  • In practice, you're not going to fit your whole working set into the L2 cache
  • Examine (or redesign) your algorithms so that the data structures you are using hold data that's needed "next" close to data that was previously needed.
  • In practice this means that you may end up using data structures that are not theoretically perfect examples of computer science - but that's all right, computers aren't theoretically perfect examples of computer science either.

    A good academic paper on the subject is Cache-Efficient String Sorting Using Copying


    Allowing mutability within functions in F# is a blessing, but it should only be used when optimizing code. Purely-functional style often yields more intuitive implementation, and hence is preferred.

    Here's what a quick search returned: Parallel Quicksort in Haskell. Let's keep the discussion about performance focused on performance. Choose a processor, then bench it with a specific algorithm.

    To answer your question without specifics, I'd say that Clojure's approach to implementing STM could be a lesson in general case on how to decouple paths of execution on multicore processors and improve cache locality. But it's only effective when number of reads outweigh number of writes.


    I am no parallelism expert, but here is my advice anyway.

  • I would expect that a locally mutable approach where each core is allocated an area of memory which is both read and written will always beat a pure approach.
  • Try to formulate your algorithm so that it works sequentially on a contiguous area of memory. This means that if you are working with graphs, it may be worth "flattening" nodes into arrays and replace references by indices before processing. Regardless of cache locality issues, this is always a good optimisation technique in .NET, as it helps keep garbage collection out of the way.
  • 链接地址: http://www.djcxy.com/p/68788.html

    上一篇: 如何通过data.table中的引用删除行?

    下一篇: F#中多核并行机制中缓存局部性的最佳实践