The beginning

In basic terms, C’s memory management gives us ‘the stack‘ and ‘the heap‘; the former used automatically by variables with function-scope, and the latter provided by C’s malloc/free and C++’s new/delete.

Managing memory allocated from the heap turns out to be error prone, with the possibility of memory leaks, double deletions, dangling pointers, and so on. All of these problems are caused by the fact that it’s up to the programmer to correctly match every new with one delete, with no restriction on said programmer from using unstructured/spaghetti-code in their buggy attempts to achieve this goal.

On the other hand, memory allocated from the stack is extremely predictable — the point of construction and destruction of every object follows a strict structure that makes leaks/double-deletions impossible, and also lets us easily reason about our code to avoid dangling pointers. N.B. ‘avoid’ dangling pointers, not ‘eliminate’ them. Bad programming can still break the stack’s structured rules, such as when returning a reference local variable, which can be erroneously accessed after it’s been popped.

In C++, the hero to this problem is RAII, and much of its magic relies on making heap allocations follow the nice, predictable lifetime rules of stack allocations, which we can easily reason about. In essence, RAII takes unstructured heap lifetimes and associates them with particular scopes (which often reside in  the stack).

The pitfalls of RAII

The uses of RAII range from the simple auto_ptr, through to smart pointers such as shared_ptr. In the OOP paradigm, it’s productive to encapsulate/hide details such as reference counting behind the scenes. The problem arises that when trying to write in a more functional style for the sake of parallelism, these encapsulated, hidden mutable states (such as reference counting) are a huge side effect. Unfortunately this makes reference counters like shared_ptr a bit of a shaky abstraction within some kinds of multi-threaded environments.

This wouldn’t be a big deal, but in a typical “thread-safe” implementation, this side effect manifests itself as an atomic counter shared between N threads. i.e. The internal reference counter is a synchronisation point between threads, and may become a bottleneck. In a worst-case scenario, you’ve got 100 cores running your app, but they’re all waiting in line to bump the same reference counter… and perhaps a reference counter wasn’t even a necessary detail in solving the original problem!


Going wait-free

If you do need to reference count objects without causing contention between threads, we can pare back the abstractions of traditional smart pointers and apply some data-oriented-design. You can replace the atomic counter with a wait-free/thread-local counter. To demonstrate with traditionally bad example code:

int g_referenceCounts[numThreads][maxObjects] = {};//init all to zero
void AddRef( int object_id )   {   g_referenceCounts[ThreadId()][object_id]++;   }
void DecRef( int object_id )   {   g_referenceCounts[ThreadId()][object_id]--;   }

Assuming the ThreadId function returns the index of the currently executing thread within the game’s thread pool, then each thread will write to it’s own copy of the reference-coutner. N.B. these integer counters are signed and may be negative, as long as the sum of a counter across every thread is >= 0 at a well defined point in time.
i.e.

void AssetReferenceCountIsValid(int object_id)
  int count = 0;
  for( int thread=0; thread!=numThreads; ++thread )
    count += g_referenceCounts[thread][object_id];
  assert( count >= 0 );
}

When the sum of an object’s reference counter across all threads == 0, it can be deleted. To lazily evaluate which counters have reached 0, we can modify the DecRef function to record the decremented object’s id in a thread-local/wait-free queue.

//make a queue of decremented objects per thread
int g_checkCounts[numThreads] = {};
int g_checkIds[numThreads][bufferSize];

void DecRef( int id )
{
  int thread = ThreadId();
  --g_referenceCounts[thread][id];
  // add object id to the garbage-check queue
  g_checkIds[thread][g_checkCounts[thread]++] = id;
}

To collect garbage objects, the queues from each thread can be merged and sorted to produce a candidate list of potentially garbage objects.

int CollectGarbage(int* output)
{
  int checkCount = 0;
  // merge/sort the input queues from all threads
  int* checkIds = Merge(
          g_checkIds, g_checkCounts,
          NumThreads(), &checkCount );
  checkIds = SortAndRemoveDuplicates(
          checkIds, &checkCount );
  // clear the inputs for next frame
  for( int t=0; t!=numThreads; ++t )
    g_checkCount[t] = 0;

  int numGarbage = 0;
  for( int i=0; i!=checkCount; ++i )
  {
    int id = checkIds[i];
    int refCount = 0;
    for( int t=0; t!=NumThreads(); ++t )
      refCount += g_referenceCounts[t][id];
    if( refCount == 0 )
      output[numGarbage++] = id;
  }
  return numGarbage;
}

Scheduling

As usual, thread scheduling must be handled with care; there must be an appropriate memory-fence between using the reference counters in AddRef/DecRef mode and CollectGarbage mode. Either any thread can call AddRef/DecRef, or no thread can call AddRef/DecRef and one thread can call CollectGarbage.

Pipelining

In the following diagrams, rows represent threads, the thin rectangles represent garbage collection tasks, the wide rectangles represent tasks that use AddRef/DecRef.

If you’re used to shared-memory programming, the most basic approach to scheduling may be to use some kind of lock/critical-section to have one thread perform garbage collection at the end of each frame, before starting on the next frame. This causes an undesirable sync-point at the end of each frame, but it simple to implement.

To pipeline the above, you can introduce some latency, where a frame does not perform it’s garbage collection until after the next/Nth frame is computed – this requires replacing the example global array with a double/N-buffered array.

Alternatively, instead of the different colours/styles in these diagrams representing different frames, they could also represent different engine sub-systems.  If, instead of using a simple global array, you created different reference-counter arrays for different systems, then you can interleave the execution of real-work/GC-work in a similar fashion to the N-buffered frame approach, such as batching up GC-work…

…or assigning the earliest thread to perform the sub-system’s/frame’s garbage collection…

or interleaving doing work with the previous GC-work, etc…

The optimal details of course depend on the threading environment/API you’re using, but the general point is that by applying a bit of DOD to the reference counting problem, you can avoid all of the cache-contention or thread-blocking problems that arise with the typical thread-shared atomic counter implementation.

Up next

So while we do need to re-invent some details in the process, we can transition our RAII staples such as reference counting quite efficiently to a multi-core environment… but do we need to? Let’s back up for a moment and consider how far we can get when only using ‘the stack‘ instead of ‘the heap‘, which means no need reference counting yet.

About these ads