Concurrency Testing Using Controlled Schedulers: An Empirical Study
Abstract
We present an independent empirical study on concurrency testing using controlled schedulers. We have gathered 49 buggy concurrent software benchmarks, drawn from public code bases, which we call SCTBench. We applied a modified version of an existing concurrency testing tool to SCTBench, testing five controlled scheduling techniques: depth-first search, preemption bounding, delay bounding, a controlled random scheduler, and probabilistic concurrency testing (PCT). We attempt to answer several research questions, including: Which technique performs the best, in terms of bug finding ability? How effective are the two main schedule bounding techniques, preemption bounding and delay bounding, at finding bugs? What challenges are associated with applying concurrency testing techniques to existing code? Can we classify certain benchmarks as trivial or non-trivial? Overall, we found that PCT (with parameter d=3) was the most effective technique in terms of bug finding; it found all the bugs found by the other techniques, plus an additional three, and it missed only one bug. Surprisingly, we found that the “naive” controlled random scheduler, which randomly chooses one thread to execute at each scheduling point, performs well, finding more bugs than preemption bounding and just two fewer bugs than delay bounding. Our findings confirm that delay bounding is superior to preemption bounding and schedule bounding is superior to an unbounded depth-first search. The majority of bugs in SCTBench can be exposed using a small schedule bound (1-2), supporting previous claims, although one benchmark requires 5 preemptions. We found that the need to remove nondeterminism and control all synchronization (as is required for systematic concurrency testing) can be nontrivial. There were 8 distinct programs that could not easily be included in out study, such as those that perform network and inter-process communication. We report various properties about the benchmarks tested, such as the fact that the bugs in 18 benchmarks were exposed 50% of the time when using random scheduling. We note that future work should not use the benchmarks that we classify as trivial when presenting new techniques, other than as a minimum baseline. We have made SCTBench and our tools publicly available for reproducibility and use in future work.