1. Cloud Comparison: How can we tell which cloud is best?
Amazon, Google, Micros
oft, or others? In this work, you will compare the performance and cost/performa nce of numerous different cloud services and help build a benchmark suite to do so. You might start with this. Some interesting reading might be this. 2. Tiered Parity: Several trends in the past decade have made protecting against data loss and unavailability much more challenging than ever before. One, the t otal quantity of data being stored is growing much more quickly than is the capa city of individual storage devices. Two, even though device capacity performance improvements are less than data quantity growth, they are still greater than de vice performance improvements. The result of these trends is that many more devi ces are needed with each passing year and that the time to recover their lost da ta is also increasing. This is why RAID-6, which replaced RAID-5, has now given way to erasure coding and why dedicated parity groups are being replaced with de clustered parity. However, declustered parity, which is needed at medium scale, has gaping flaws at hyper-scale. One solution might be tiered parity schemes whi ch create medium sized pools of declustered inner parity and then larger pools o f outer parity across them. Studying tiered parity is an important step to ensur e that the hyper-scalar cloud providers as well as scientific super-computing ce nters can continue to store ever increasing amounts of data without loss while m inimizing overheads. This study will entail surveys of related research, discuss ions with industry and national lab researchers, development of mathematical mod els and/or system simulations, and presentations of results. 3. Science of Scalability: Scalability is a key feature of distributed systems, yet what does it really mean? In this project, you'll develop techniques to stud y how various systems scale. What are the key limitations on scale, and how can you evaluate a particular system's ability to overcome these limitations? A numb er of ways to go on this: build a simulation platform, or figure out how to run real code on some kind of emulator to stress test it. 4. User-level Logging Alternatives: As we saw in the ALICE paper, there are a nu mber of different ways to implement local update protocols. In this work, you wi ll study the range of different approaches used across systems, and try to class ify them into a taxonomy. What are the common techniques? Can a general approach be developed and plugged in underneath a number of different systems, without l osing performance? You will start with a survey of how things are built, and the n perhaps try building a generic library that can serve as a substitute, thus ma king high-performance correct crash consistency achievable for all. 5. OpenLambda: We are building a new microservice platform called OpenLambda (re ad this for more info). In this project, you will work on some aspect of OpenLam bda. Some ideas include: faster container startup, low-latency database support, measurement and analysis of other platforms (OpenWhisk), or, perhaps best of al l, is to build a number of different services atop existing microservices archit ectures to learn what is truly important. 6. Distributed System Performance Analysis: Distributed systems are complicated and hard to debug when it comes to performance problems. In this work, you'll in ject monitoring into a particular distributed systems (or two) and use it to try to understand different performance problems that arise. One general idea that could be put to use here is to monitor any/all queues in the system; can queue l engths be readily used to diagnose performance problems, or, better yet, suggest solutions to them? 7. The New PC Era -- Personal Clouds: In this project, you'll make it as easy to use the cloud as it is to run a python script on your local machine. The PC pro ject will have you build a system to enable the easy launch of python/whatever s cripts on a cloud service; problems you'll have to solve include how to securely launch a job, how to access data, and how to manage cloud resources within a gi ven budget. This will be a fun building project, if you like that sort of thing. 8. Important of Low-Overhead Communication/Storage: In this project, you'll stud y existing distributed systems and try to answer this question: how fast does th e network really need to be? You'll base your work on the idea found in this pap er and extend it to modern systems such as MongoDB, Cassandra, etc. How importan t is having a really high performance network? The same question, it should be n oted, can be asked of the storage layer. 9. Surveying Modern Hardware Failure: How does modern hardware fail? In this pro ject, you'll survey the literature to create a model of how modern hardware fail s, including memories, SSDs, disks, networks, and other hardware parts. You'll f irst read all relevant literature, and then try to capture in simplest possible terms how faults arise in modern systems. End goal would be to produce a journal paper that summarizes all previous work in this space. 10. SSD Fault Simulation: We read about how SSDs fail in class, and some theorie s of why the very unusual bathtub shaped failure rates arise. In this project, y ou'll build a simulator of SSDs that includes localized failure behavior, and se e if you can replicate the type of behavior seen in the SSD failure paper. What are important parameters? How does such failure (and the remapping needed in the FTL) affect performance? Does the theory of SSD failure found in that paper mat ch what you can produce via simulation? 11. Abort Conditions and Failure Handling in 2PC: In this work, you'll evaluate two-phase commit in live systems. First, find a system or two that use 2pc to pe rform distributed transations (e.g., PostgreSQL, others?). Then, start to trace through how it works, in particular focusing on when aborts arise and in general what failure cases are handled. What conditions cause an abort vote from one no de? Fault injection could be useful here; perhaps you could insert disk failures (write() returns an error) and memory-allocation failures (malloc() returns nul l) on one node to see what happens during distributed transaction commit. 12. Faults and Divergence in Replicated State Machines: One hard problem in RSMs is ensuring no divergence; even if replicated servers receive the same inputs, there is no guarantee they will behave the same, so careful programming is requi red. In this project, you will inject faults into some real replicated services and see if you can get the replicas to diverge (to make different decisions and thus lead to observably different behavior). What happens when a disk write fail s? When a memory allocation fails? What if memory becomes corrupt? etc. 13. Hyper-scale Simulation: It is incredibly challenging to run and test systems at scale. In this project, you'll solve that problem by building a simulator th at can mimic the behavior of various real distributed systems at scale. Start wi th a real system (e.g., Google File System) and build pieces of it in your simul ator, making it as detailed as possible. Then, scale up disks, machines, clients , network, etc., and see how the system behaves. You likely would have to think about how start inducing different types of failures to see really interested be haviors (e.g., disk failures lead to background data migration, which can have a cost on foreground performance). 14. A Theory of Tail Latency and Predictable Local Storage Systems: Tail latency has become a central focus in performance of large-scale systems. The question is not what the average response time is, but rather what the 99th percentile of requests will see. Clearly, the closer 99th percentile behavior is to average, the more predictable the behavior of your system is. In this project, you'll sta rt by measuring latencies of local file systems (the key building blocks in dist ributed storage) to understand what kind of latency profiles are common. What fu nctionality in the file system can lead to different observed tail performance? (think about reading/writing, caching with different sized memories, path hierar chy depth, lots of files in a directory, and other functionality that could affe ct performance) If you find some interesting problems, you could then take the n ext step and start to build a more predictable local storage system; how can you make it such that the local storage system is a highly predictable building blo ck for larger scale systems? 15. Stressing The Maintenance System: Scalable storage systems have to perform v arious amounts of maintenance to ensure that the data within them remains safe. For example, HDFS has to scan data in the background to ensure the right number of replicas are available, and then make more copies as need be when nodes are d own. However, it is challenging to build such maintenance robustly; if too many nodes are down, the system will start replicating data at too high of a rate (hu rting performance) or not quickly enough (hurting availability). In this project , you'll take existing systems and build tools to stress their background mainte nance activities, in order to understand how they work and what their limits are . If you have time you can even improve existing systems by making them more rob ust to a range of failure behaviors. 16. The Science of Distributed Evaluation: Many papers produce results on how va rious distributed systems perform; how reproducible are said results? In this pr oject, you'll take on the task of comparing the performance of modern key-value storage systems to see if what the papers say matches your own reality. We'll pr ovide some papers to start with, and we'll proceed by evaluating the evaluations themselves.