Sie sind auf Seite 1von 3

1. Cloud Comparison: How can we tell which cloud is best?

Amazon, Google, Micros


oft, or others? In this work, you will compare the performance and cost/performa
nce of numerous different cloud services and help build a benchmark suite to do
so. You might start with this. Some interesting reading might be this.
2. Tiered Parity: Several trends in the past decade have made protecting against
data loss and unavailability much more challenging than ever before. One, the t
otal quantity of data being stored is growing much more quickly than is the capa
city of individual storage devices. Two, even though device capacity performance
improvements are less than data quantity growth, they are still greater than de
vice performance improvements. The result of these trends is that many more devi
ces are needed with each passing year and that the time to recover their lost da
ta is also increasing. This is why RAID-6, which replaced RAID-5, has now given
way to erasure coding and why dedicated parity groups are being replaced with de
clustered parity. However, declustered parity, which is needed at medium scale,
has gaping flaws at hyper-scale. One solution might be tiered parity schemes whi
ch create medium sized pools of declustered inner parity and then larger pools o
f outer parity across them. Studying tiered parity is an important step to ensur
e that the hyper-scalar cloud providers as well as scientific super-computing ce
nters can continue to store ever increasing amounts of data without loss while m
inimizing overheads. This study will entail surveys of related research, discuss
ions with industry and national lab researchers, development of mathematical mod
els and/or system simulations, and presentations of results.
3. Science of Scalability: Scalability is a key feature of distributed systems,
yet what does it really mean? In this project, you'll develop techniques to stud
y how various systems scale. What are the key limitations on scale, and how can
you evaluate a particular system's ability to overcome these limitations? A numb
er of ways to go on this: build a simulation platform, or figure out how to run
real code on some kind of emulator to stress test it.
4. User-level Logging Alternatives: As we saw in the ALICE paper, there are a nu
mber of different ways to implement local update protocols. In this work, you wi
ll study the range of different approaches used across systems, and try to class
ify them into a taxonomy. What are the common techniques? Can a general approach
be developed and plugged in underneath a number of different systems, without l
osing performance? You will start with a survey of how things are built, and the
n perhaps try building a generic library that can serve as a substitute, thus ma
king high-performance correct crash consistency achievable for all.
5. OpenLambda: We are building a new microservice platform called OpenLambda (re
ad this for more info). In this project, you will work on some aspect of OpenLam
bda. Some ideas include: faster container startup, low-latency database support,
measurement and analysis of other platforms (OpenWhisk), or, perhaps best of al
l, is to build a number of different services atop existing microservices archit
ectures to learn what is truly important.
6. Distributed System Performance Analysis: Distributed systems are complicated
and hard to debug when it comes to performance problems. In this work, you'll in
ject monitoring into a particular distributed systems (or two) and use it to try
to understand different performance problems that arise. One general idea that
could be put to use here is to monitor any/all queues in the system; can queue l
engths be readily used to diagnose performance problems, or, better yet, suggest
solutions to them?
7. The New PC Era -- Personal Clouds: In this project, you'll make it as easy to
use the cloud as it is to run a python script on your local machine. The PC pro
ject will have you build a system to enable the easy launch of python/whatever s
cripts on a cloud service; problems you'll have to solve include how to securely
launch a job, how to access data, and how to manage cloud resources within a gi
ven budget. This will be a fun building project, if you like that sort of thing.
8. Important of Low-Overhead Communication/Storage: In this project, you'll stud
y existing distributed systems and try to answer this question: how fast does th
e network really need to be? You'll base your work on the idea found in this pap
er and extend it to modern systems such as MongoDB, Cassandra, etc. How importan
t is having a really high performance network? The same question, it should be n
oted, can be asked of the storage layer.
9. Surveying Modern Hardware Failure: How does modern hardware fail? In this pro
ject, you'll survey the literature to create a model of how modern hardware fail
s, including memories, SSDs, disks, networks, and other hardware parts. You'll f
irst read all relevant literature, and then try to capture in simplest possible
terms how faults arise in modern systems. End goal would be to produce a journal
paper that summarizes all previous work in this space.
10. SSD Fault Simulation: We read about how SSDs fail in class, and some theorie
s of why the very unusual bathtub shaped failure rates arise. In this project, y
ou'll build a simulator of SSDs that includes localized failure behavior, and se
e if you can replicate the type of behavior seen in the SSD failure paper. What
are important parameters? How does such failure (and the remapping needed in the
FTL) affect performance? Does the theory of SSD failure found in that paper mat
ch what you can produce via simulation?
11. Abort Conditions and Failure Handling in 2PC: In this work, you'll evaluate
two-phase commit in live systems. First, find a system or two that use 2pc to pe
rform distributed transations (e.g., PostgreSQL, others?). Then, start to trace
through how it works, in particular focusing on when aborts arise and in general
what failure cases are handled. What conditions cause an abort vote from one no
de? Fault injection could be useful here; perhaps you could insert disk failures
(write() returns an error) and memory-allocation failures (malloc() returns nul
l) on one node to see what happens during distributed transaction commit.
12. Faults and Divergence in Replicated State Machines: One hard problem in RSMs
is ensuring no divergence; even if replicated servers receive the same inputs,
there is no guarantee they will behave the same, so careful programming is requi
red. In this project, you will inject faults into some real replicated services
and see if you can get the replicas to diverge (to make different decisions and
thus lead to observably different behavior). What happens when a disk write fail
s? When a memory allocation fails? What if memory becomes corrupt? etc.
13. Hyper-scale Simulation: It is incredibly challenging to run and test systems
at scale. In this project, you'll solve that problem by building a simulator th
at can mimic the behavior of various real distributed systems at scale. Start wi
th a real system (e.g., Google File System) and build pieces of it in your simul
ator, making it as detailed as possible. Then, scale up disks, machines, clients
, network, etc., and see how the system behaves. You likely would have to think
about how start inducing different types of failures to see really interested be
haviors (e.g., disk failures lead to background data migration, which can have a
cost on foreground performance).
14. A Theory of Tail Latency and Predictable Local Storage Systems: Tail latency
has become a central focus in performance of large-scale systems. The question
is not what the average response time is, but rather what the 99th percentile of
requests will see. Clearly, the closer 99th percentile behavior is to average,
the more predictable the behavior of your system is. In this project, you'll sta
rt by measuring latencies of local file systems (the key building blocks in dist
ributed storage) to understand what kind of latency profiles are common. What fu
nctionality in the file system can lead to different observed tail performance?
(think about reading/writing, caching with different sized memories, path hierar
chy depth, lots of files in a directory, and other functionality that could affe
ct performance) If you find some interesting problems, you could then take the n
ext step and start to build a more predictable local storage system; how can you
make it such that the local storage system is a highly predictable building blo
ck for larger scale systems?
15. Stressing The Maintenance System: Scalable storage systems have to perform v
arious amounts of maintenance to ensure that the data within them remains safe.
For example, HDFS has to scan data in the background to ensure the right number
of replicas are available, and then make more copies as need be when nodes are d
own. However, it is challenging to build such maintenance robustly; if too many
nodes are down, the system will start replicating data at too high of a rate (hu
rting performance) or not quickly enough (hurting availability). In this project
, you'll take existing systems and build tools to stress their background mainte
nance activities, in order to understand how they work and what their limits are
. If you have time you can even improve existing systems by making them more rob
ust to a range of failure behaviors.
16. The Science of Distributed Evaluation: Many papers produce results on how va
rious distributed systems perform; how reproducible are said results? In this pr
oject, you'll take on the task of comparing the performance of modern key-value
storage systems to see if what the papers say matches your own reality. We'll pr
ovide some papers to start with, and we'll proceed by evaluating the evaluations
themselves.

Das könnte Ihnen auch gefallen