Heap allocation internals, garbage collection, and exam review

May 4, 2016

Last class of the semester. Exam is one week from today at the regular schedule. Starts at 7:10 PM, closed everything. Shouldn’t be harder than the midterm.

Today we’re going to finish with garbage collection, and review topics for the final exam.

Continuing with heap structure

Reviewing, you can either have a linked list of free blocks in the heap (as discussed last week), or using a heap pointer.

Heap pointer

The heap keeps a pointer to the “top” of the heap. You allocate space by bumping up the pointer, sort of like a stack, and returning the address of that bliock. So allocation is very cheap.

But what happens when you deallocate something? Likely, whatever you deleted is somewhere in the middle of the used space. You end up with a gap in the center, and you have to copy down by the size of the released block all the memory in the heap to fill in that space. This is called heap compaction, and is obviously expensive.

So there are tradeoffs in both.

Reclaiming storage

How do you tell in the programming language when an object is not being used anymore?

C and C++, or even Ada, you have to do it manually, with free() and delete respectively, for C and C++. If you get it wrong, you end up with bad bugs. These bugs are often hard to reproduce because the data may still be there (it doesn’t get overridden) and so the bug may not show up obviously.

Other languages have automatic storage reclamation (also known as garbage collection) that try to tell when storage is no longer needed, and frees that memory.

The garbage collector has two jobs:

Figure out when an object is no longer accessible, and then
Free that memory.

There are two major catoegories of garbage collectors, mark/sweep or copying garbage collectors, and reference counting collectors.

Mark/sweep and copying garbage collectors

Generally these garbage collectors watch to see when the heap fills up, and when it does, they traverse the entire heap and view which objects are no longer accessible.

The idea is that every live object in the heap is accessible from a variable in a stack frame (or a register, for those who know a bit about processor architecture). Even if they’re not accessible directly from a stack variable, they should be accessible from a series of pointer indirections that initiate at a stack variable. Any object that isn’t accessible from a set of indirections starting at a stack variable is dead.

Both these types of garbage collectors traverse the graphs of live objects starting at all the existing stack frames. Where they differ is in what they do in the traversal of the graph.

Mark/sweep GC

This collector has two phases: the mark phase and the sweep phase. For every object, there’s an extra bit in the object that tells you whether it’s live or not. Initially, all these objects have a 0 mark bit.

When the heap is full, it starts traversing the live graph, marking all the objects that are live with a 1. Then, it moves on to the sweep phase: it loops through the whole heap, leaving all the objects with a mark bit of 1 alone, and putting all objects with a mark bit of 0 on the free list.

It’s a lot harder to do this with a heap pointer, since you have to do a lot of compacting. So a free-list implementation is more typical.

Let’s look at efficiency.

The mark phase traverses the live objects, so the cost is proportional to the number of live objects in the heap, O(L). Usually this is about 1/3 of the heap.

The sweep phase has to go to the entirety of the heap, O(H). And since this is bigger by definition, it ends up being O(H) in total.

Note that this kind of garbage collection stops the program from running, so this kind of garbage collector is bad for real-time interactive systems.

There’s also the downside of using a free list, so that it has expensive allocations, and it’s sometimes hard to find a chunk of memory that fits the requested size.

Copying GC

This is what ML and Scheme use.

The basic idea is that there are two heaps: the From heap and the To heap.

When you allocate objects, they go into the From heap. Once it’s full, the program stops, and the GC traverses the graph in the same way the mark phase of the mark/sweep collector does.

Instead of marking objects, though, the GC copies every single live object it finds into the To heap.

When it does that, it leaves a forwarding pointer in each object in the From heap pointing to the new copy of the object, for use during the copying process.

Internal pointers within the heap are also updated to point to the new locations in the To heap. And it sweeps through the stack afterwards updating all the pointers to the new location.

Then, the From and To heaps are swapped.

The benefit to this, is that it automatically compacts these objects as they’re copied, making allocation very easy with a heap pointer. It also means that the cost of garbage collection is only proportional to the size of the live objects, O(L).

Does this cause problems because your heap is half the size? Not necessarily. Usually the OS swaps out the To heap to disk and doesn’t touch it much. In fact, often the To heap doesn’t even have to exist while the garbage collection process isn’t running.

But, it still causes problems with real-time applications, because it still has to stop the program and copy everything.

Reference counting

The previous garbage collectors all stop the world and traverse the live objects. Reference counting systems are different.

Each object keeps a count of how many objects are accessing it. When another pointer points to this object, the reference count is increased. When a pointer is removed, the reference count is decreased. When the reference count hits 0, the object immediately gets put on the free list.

It’s harder to talk about the cost of reference counting, because the cost is “spread out” in every pointer operation: each pointer creation, destruction, and modification gets a little bit slower. But the benefit of this is that it works a lot better for real-time programs, because it doesn’t “stop the world”. It’s not foolproof—deleting large data structures like a long linked list can still be pretty slow—but it’s better than the ones we looked at before.

But there’s one big problem with reference counting: cycles. Consider a linked list with a loop, where the last element points back to the first: the reference count of the first element will be 2. Then, even if the stack variable pointing to it is deleted, every element in the list will still have a count of 1, and the whole object will stick around.

To fix this problem a lot of reference-counted languages have a mark/sweep or copying garbage collector as a backup.

In a pure functional language, this isn’t a problem, because creating a cycle requires mutating an older object to point to a newer object.

There’s one more type of GC to talk about, the one that’s actually used in Scheme, ML, etc.

Generational copying GC

This just like the copying GC, but it has more than two heaps.

You start with a large heap, for “generation 0”, where objects are allocated. When this fills up, you copy all live objects from the gen 0 heap to the next heap, gen 1. When gen 1 fills up, you copy every live object into gen 2, and so on.

The idea behind this is that the longer an object has been alive, the longer an object will stay alive. Meanwhile, a lot of small objects are allocated, used once, and then thrown away. Objects that stay alive for a long time end up in the higher generations.

Since gen 0 has a small proportion of live objects, you end up with cheaper copying garbage collection.

The only complication is that in a non-pure-functional language, you may end up with pointers between the generations that you then have to keep track of.

And that’s all she wrote.

Final exam

The final exam is cumulative, but at least 70% of the exam will be on topics after the midterm.

Best way to get a sense of the topics is to look through the topics for each week of the course. There isn’t going to be a lot of Ada programming or Scheme programming, since that was on the midterm. Be sure you know your recursive thinking though, and be familiar with later languages.

Topics on the exam can include (note that this is an incomplete list):

Everything from the midterm
λ calculus
- Conversions
- Y combinator (don’t memorize how the Y-combinator looks like, but know how to use it)
- Reduction order
- Church-Rosser theorems
ML
- Know generally how to write ML code
- Pattern matching
- Types
- Parametric polymorphism
- Know how to infer types like the compiler does
- Constraints on the language, e.g. homogeneous lists
OO languages
- Definition of OO, what makes a language OO
- How subtyping is expressed in Java and Scala
- Subset interpretation of subtyping (inc. for functions and generics)
- Function subtyping (know and justify the contra- vs covariant subtyping)
- Java
  - Know generally how to define an ordinary class, and generic classes
  - Emphasis will be on generics
- Scala
  - Case classes
  - Function subtyping
  - Scala generics
  - Covariant and contravariant generics
  - And general Scala code
- If you did the assignments well, you should be fine
Storage allocation
- Free list vs. heap pointers
- Different types of garbage collectors