Sie sind auf Seite 1von 4

Scalable Storage Systems

Abstract Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloud based storage services. (1) The ideal distributed file system would provide all its users with coherent, shared access to the same set of files, yet would be arbitrarily scalable to provide more storage space and higher performance to a growing user community. It would be highly available in spite of component failures. It would require minimal human administration, and administration would not become more complex as more Components were added. (2) Introduction The new generation of applications requires processing of terabytes and even petabytes of data. This is achieved by distributed processing. This is one of major reasons for the power of web companies such as Google, Amazon and Yahoo!. There are several reasons for distributed processing. On one hand, programs should be scalable and should take advantage of multiple systems as well as multicore CPU architectures. On the other end, website servers have to be globally distributed for low latency and failover. Distributed processing implies distributed data. This is a different beast compared to traditional relational database systems. Several researchers have suggested that this is an end of an architectural era and that relational database system vendors have to start over. However, the web has changed the requirements of storage database systems for the next generation of applications. We hear about several companies fighting traditional databases to meet their requirements. There are several lessons that have to be learnt from the web including simplicity, scalability, caching, flexibility to handling graphs and allowing simple flexible queries. (3)
(1) Availability in Globally Distributed Storage Systems - Daniel Ford, Francois Labelle, Florentina I.

Popovici, Murray Stokely, Van-Anh Truong_,Luiz Barroso, Carrie Grimes, and Sean Quinlan Google, Inc. - http://static.usenix.org/event/osdi10/tech/slides/boyd-wickizer.pdf
(2) Frangipani: A Scalable Distributed File System -Chandramohan A. Thekkath-Timothy Mann -Edward

K. Lee- Systems Research Center Digital Equipment Corporation (3)

http://pdos.csail.mit.edu/6.824/papers/thekkath-frangipani.pdf http://www.swaroopch.org/notes/Distributed_Storage_Systems

Distributed File systems are designed for today's high-performance, virtualized cloud environments. Unlike traditional data centers, cloud environments require multi-tenancy along with the ability to grow or shrink resources on demand. Enterprises can scale capacity, performance, and availability on demand, with no vendor lock-in, across on-premise, public cloud, and hybrid environments. (1)

Common Distributed File system Architecture figure 1-1 (2)

Assumptions In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details. The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis. The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them.
(1) Gluster File system 3.2.5 Administration Guide - Introducing Gluster File System (2) http://wikibon.org/wiki/v/And_the_world_spawns_another_file_system

The workloads primarily consist of two kinds of reads: -Large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. -Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary offset. -Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth. The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient. The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer consumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through the file simultaneously. High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.

Previous Work :
Gluster File System: is an open source File system. The software is a powerful and flexible solution that simplifies the task of managing unstructured file data whether you have a few terabytes of storage or multiple petabytes. Gluster Storage Platform integrates the file system, an operating system layer, and a web-based management interface and installer. Installation is a simple process that enables customers to deploy a few hundred terabytes of clustered storage in two steps and just a few mouse clicks. Gluster Storage Platform runs on industry standard hardware from any vendor and delivers multiple times the scalability and performance of conventional storage at a fraction of the cost. Gluster Storage Platform clusters together storage building blocks, aggregating disk and memory resources and managing your data in a single global namespace. It employs a stackable file system architecture that can be optimized for specific application profiles with simple plug-in modules, optimizing performance for a wide range of workloads.

Gluster Advantages:
Scalability Linear scalability makes it easy to expand as needed. Truly linear scalability Scales to hundreds of petabytes Add new nodes with two mouse clicks Performance Unique architecture eliminates the need for a metadata sever and produces truly ground-breaking IOPS and throughput. No metadata server No central point of failure Automatic load balancing Stripe files across dozens of storage blocks Kernel Independent Gain the benefits of kernel-independence without any of the performance tradeoffs. No complex cross-platform compatibility issues No problematic upgrades Easy implementation of custom applications and modules. High Availability High availability options include data mirroring and real time self-healing. Error detection and correction even while files are running or during hardware failures Easy Deployment & Management Deploy petabyte-scale storage in just 15 minutes. Ongoing maintenance is simple and efficient with integrated management of volumes, data resources, and servers. Self-Healing (no fsck) Automatic error-correction Wizard based deployments Centralized logging and reporting No system downtime for upgrades or new nodes Optimized for Virtual Servers True virtual storage for server virtualization. Store VM images and data in single namespace Multiple copies of VM images provide always-on performance even during hardware failure Quick, Flexible Setup Gluster features push-button setup enabling you to customize your storage settings for your workload. You can easily optimize for small files, cloud computing, or any other usage. Native clients available but not required Supports NFS, CIFS, HTTP, WebDAV, and FTP protocols Network: Gigabit Ethernet, 10 Gigabit Ethernet, InfiniBand Flexible Modular Design Glusters modular architecture enables easy addition of features without the bloat for features that you dont require. Easy, fast feature enhancements Design Gluster for your workload Eliminate unneeded features for optimum speed

Das könnte Ihnen auch gefallen