Software development final exam answers: Part 2

These are the answers, along with some commentary, for part 2 (Computer Architecture and Operating Systems) of my software development final exam.

Each question was graded out of 5, with points awarded for different parts of each question as described below; in some cases I awarded partial marks (e.g., 2 of 3 possible marks for part of a question) when someone had the right general idea but missed some details.

Computer Architecture and Operating Systems

Question: Name and describe the four states in MESI cache coherence.
Answer: The four states are "Modified", "Exclusive", "Shared", and "Invalid". In the "Modified" state, a cache line holds data which has been modified and must be written back to main memory before the cache line is invalidated or the memory location is read by another cache; no other caches may hold the same data. In the "Exclusive" state, the data matches what is stored in main memory and it is guaranteed that no other caches hold copies of the same data; this cache line can transition into the Modified state without any bus traffic. In the "Shared" state, the data matches what is stored in main memory but it is possible that other caches hold copies of the same data; if the cache line is written to, it must be first invalidated in other caches. In the "Invalid" state, the cache line is not valid and its contents is ignored (this is the state all cache lines start in when the CPU is first powered up).
Grading: I awarded 1 mark for getting the names right and 1 mark for each of the states correctly described (even if their names were wrong or absent).
Commentary: The average score on this question was 0.9/5, which I found disappointingly low. 75% of respondents received a score of zero, in most cases due to not even attempting the question.
A few people answered the question without any apparent hesistation, but the majority of those who responded clearly did not know the answer but instead figured it out. This is what I was hoping for, and why I phrased the question as I did: Saying "the four states in MESI cache coherence" was a cue (which several people picked up on) that people should be looking for states with names starting with the letters "M", "E", "S", and "I".
This question was, more than anything else, a test of whether people understood caches and cache coherence enough to invent a cache coherence protocol — because if you try, you end up with something similar to MESI pretty quickly. Understanding caches and coherence is very important for code performance: False sharing, for example, can cause a dramatic increase in bus traffic, and corresponding reduction in performance, if two separate threads access memory locations which are close to each other.
Question: Which of the following two functions is faster, and why? Assume that the C code is translated into machine code in the obvious manner, without optimization.
```
void zeroarray1(double * A)
{
        int i, j;

        for (i = 0; i < 1024; i++)
                for (j = 0; j < 1024; j++)
                        A[i * 1024 + j] = 0.0;
}

void zeroarray2(double * A)
{
        int i, j;

        for (j = 0; j < 1024; j++)
                for (i = 0; i < 1024; i++)
                        A[i * 1024 + j] = 0.0;
}
```
Answer: The function zeroarray1 is faster, because it accesses memory in sequential locations; this both reduces bus traffic, since each cache line is only loaded and stored once (in zeroarray2, the first cache line will almost certainly be evicted before we come around to j=1, i=0 and try to access it again) and on modern CPUs allows for automatic prefetching to occur.
Grading: I awarded 2 marks for getting the right answer; 2 marks for pointing out that it was because sequential memory accesses were better for caching; and 1 mark for giving a reason why sequential memory accesses helped for caching.
Commentary: The average score on this question was 4.5/5; about 75% of people scored 5/5, and a large number of people scored 4/5 and could probably have made it 5/5 by writing a few more words. (A few of the people scoring 4/5, however, admitted that they knew that sequential memory accesses were better for caches, but weren't sure why.)
To quote Phil Karlton, there are only two hard things in Computer Science: cache invalidation and naming things. (Some people add "and off-by-one errors!" to this.) I didn't ask any questions about naming things on this exam, but understanding caches is absolutely vital, and I couldn't let this test go without having one straightforward cache question.
Despite the high marks on this question, it revealed a surprising amount of confusion: Some people imagined cache lines as small as 4 bytes, while others thought the CPU cached entire 4 kB pages at a time. (In fact, there is a cache which deals with pages, but that's the TLB — used for paging table lookups — not the data cache.) Conversely, the "double" type — which I picked on the assumption that as an IEEE 754 standard data type, it would be widely understood — was believed to be as small as 1 byte in size and as large as 16 bytes. Nevertheless, since these were not relevant to the key point of the question — memory access patterns — I did not deduct marks for them.
Question: Explain the difference between a mutex and a read-write lock.
Answer: A mutex, or MUTual EXclusion lock, can only be "owned" by a single thread at once; if another thread wants the lock, it has to wait until the first thread releases it. A read-write lock — also known as a shared-exclusive lock, leading on occasion to the abbreviation "sex lock" — can be locked either for reading or for writing. If a thread holds a read-write lock for reading, then other threads can also obtain read locks; but if locked for writing, no locks (either reading or writing) can be obtained by other threads.
Grading: I awarded 2 marks for describing a mutex and 3 marks for describing a read-write lock.
Commentary: The average score on this question was 3.7/5; about two thirds of people scored 5/5 and a large minority of the rest scored 2/5.
Locking is an essential issue to understand, not just for operating system development, but also for any sort of multi-threaded programming, and understanding that there are multiple types of locking primitives is a very good thing. It's always possible to do without read-write locks — if nothing else, you can synthesize them out of mutexes — but being aware of them as a standard tool makes it far more likely that someone will recognize when they should be used.
Question: Why does the pthread_cond_wait function (wait on a condition variable) take a locked mutex as a parameter?
Answer: The function atomically unlocks the mutex and puts the thread to sleep in order to avoid a race between the "waiting" thread and the "signalling" thread: Without the mutex, the "signalling" thread might send a wakeup signal after the "sleeping" thread checks the condition but before it goes to sleep; this would result in the wakeup signal getting lost (since nobody was listening for it) and the "sleeping" thread potentially sleeping indefinitely as a result.
Grading: This was simple: Either you got it, for 5 marks, or you didn't. To get the marks, I was looking for some understanding of exactly what would go wrong without the mutex.
Commentary: The average score on this question was 1.2/5; in other words, about a quarter of people got it.
This question was exceptional for the range of wrong answers provided; some of the repeated responses were that the mutex prevented multiple threads from sleeping on the condition variable at the same time (in fact, it's quite common for multiple "worker" threads to sleep on the same condition variable); that the mutex ensured that only one thread would wake up at once (in fact, POSIX provides separate pthread_cond_signal and pthread_cond_broadcast functions which wake up either one or all the sleeping threads); and that the ownership of the mutex would be transferred from the signalling thread to the sleeping thread as an optimization to avoid an unlock/lock pair (a clever idea, but not true; the wakeup functions don't take the mutex as a parameter).
I think this was the most important question on this part of the exam: If you don't understand caches you will end up writing slow code, but if you don't understand race conditions, you will can end up writing code which is broken — and not just broken in obvious ways, but broken in horribly subtle ways which will only bite you once in a blue moon and/or when you're getting the most traffic in your website's life and really really need for things to not break.
And before anyone objects that this is only relevant to multi-threaded programming: Race conditions are also responsible for many security bugs where a program races against an attacker, and come up very often when dealing with distributed systems — and sometimes the two overlap, as with a vulnerability which recently allowed over a million dollars to be stolen via ATMs.
Question: What is a TLB? When and why is it most often (completely) flushed?
Answer: A TLB is a Translation Lookaside (or Lookahead) Buffer; a cache of paging table entries. It is most often flushed when the operating system hands ownership of the CPU over to a new process, since different processes (usually) each have their own paging tables, and thus the values cached in the TLB are no longer correct copies of the paging tables in use.
Grading: I awarded 2 marks for describing what the TLB did (acronym expansion not required) and 3 marks for knowing that it was usually flushed during a context switch.
Commentary: The average score on this question was 2.9/5; over half of respondents received full marks.
The TLB is rarely mentioned in discussions about caching, but it is vitally important on modern systems. While I was looking for — and people provided — a discussion of data caches in question 2, I realized later that on many systems using standard 4 kB pages, the largest part of the penalty faced by zeroarray2 would be that each access would incur a TLB miss — not just requiring data to be read from main memory, but also requiring paging tables to be read and parsed.
The importance of TLBs to performance notwithstanding, it wasn't the main reason I included this question; rather, I decided to use it as a test for general understanding of virtual memory: If you know that paging tables are used to translate virtual addresses to physical addresses and that different processes have different paging tables, you've at least got the basics in place. Indeed, an earlier draft of this exam had the question "explain the difference between virtual and physical memory" — but I decided to replace it because it was too vague and likely to be hard to mark.

For this entire part, the average combined score was 13.2/25; approximately equal numbers of people received scores in each of the ranges 0-9, 10-12, 13-17, and 18-25.

I will be posting my answers, along with commentary, to part 3 of the exam — Mathematics — shortly after I finish marking the responses I have received to it.

Posted at 2012-10-31 09:00 | Permanent link | Comments

Daemonic Dispatches

Software development final exam answers: Part 2

Computer Architecture and Operating Systems

Recent posts

Monthly Archives

Yearly Archives