XtremIO Architecture and Allocation

Introducing XtremIO
Hardware Overview
The building block of XtremIO is an X-Brick.
XtremIO X-Brick
An X-Brick is composed of 2-storage controllers, a DAE holding 25xSSDs and two
battery backup units. Each X-Brick at GA will utilize 25x400GB eMLC SSDs and provide
10TB of raw capacity.
The array scales by adding X-Bricks in a scale-out manner (max of 4 X-Bricks at GA):
X-Brick Scaleout
The interconnect between all X-Bricks is 40Gbps Infiniband (similar to Isilon, except
Isilon uses 10Gbps Infiniband). If we think of Isilon as scale out NAS, XtremIO
represents the scale out all flash block category.
XtremIO Node
Connectivity Ports and Node details:
The 40Gbps IB ports are for the back-end connectivity.
For front-end host connectivity XtremIO supports both 8Gb Fiber Channel and
10Gb iSCSI.
Connectivity to the disks are via 6Gbps SAS connections (very similar to a VNX).
Each node contains 2x SSD to serve as a dump area for metadata if the node
were to lose power.
Each node also contains 2x SAS drives to house the operating system. In this
way, the disks in the DAE themselves are decoupled from the controllers since
they only hold data, and this should facilitate easy controller upgrades in the
future when better/faster hardware becomes available.
At GA, each node is essentially a dual socket 1U whitebox server utilizing 2x

Intel 8-core Sandy Bridge CPUs and 256GB of RAM.
XtremIO Scaling Performance
Software Features
Inline Deduplication. Part of the system architecture and lowers effective cost
while increasing performance and reliability by reducing write amplification.
Thin Provisioning. Also part of the system architecture in how writes are managed
and carry no performance penalty.
Snapshots. No capacity or performance penalty due to the data management and

snapshot architecture.
XDP Data Protection. RAID6 designed for all flash arrays. Low overhead with
better than RAID1 performance. No hot spares.
Full VAAI Integration.
Architecture
XtremIO Software Architecture
XtremIO Software Architecture

Under the covers, the system runs on top of a standard Linux kernel and the XtremIO
software, XIOS, executes 100% in userspace. Running in 100% userspace avoids
expensive context switching operations, provides for ease of development and does not
require the code to be published under GPL. An XIOS implemented in the Linux kernel
would require the code to be published via GPL, which poses a problem for a company
like EMC that desires to protect its IP.
An XIOS instance called the X-ENV runs on each CPU socket. The CPU and memory
itself is monopolized by XIOS and doing a top command in Linux would reveal a single
process per CPU socket taking 100% of the resources, and this allows the XIOS to
manage the hardware resources giving the ability to provide 100% predictable and
guaranteed performance leaving nothing to chance (I.E. an outside process or kernel
scheduler impacting the environment which would be unbeknown to XIOS). An
interesting side effect of this is that the software architecture being 100% in userspace
COULD allow for movement from Linux to another OS, or from X86 to another CPU, if
required. This isnt a likely scenario without some serious mitigating factor, but infact, it
could be possible due to the design.
The first thing to note in the architecture is that it is software defined in the sense that it
is independent of the hardware itself. This is evidenced by the fact that it took the
XtremIO team a very short period of time to transition from their pre-acqusition hardware
to EMCs whitebox standard hardware. Things which are NOT in the X-Brick include:
FPGAs, custom ASICs, custom flash modules, custom firmware, and so on. This will
allow the XtremIO team to take advantage any X86 hardware enhancements including
speeds/feeds improvements in the HW, density improvements, new interconnect
technologies, etc without much hassle. XtremIO really is a software product delivered in
an appliance form factor. While there is nothing preventing XtremIO from delivering a
software only product, it would be encumbered with the same challenges as all the other
software only storage distributions on the market face, namely, the difficulty in
guaranteeing predictable performance and reliability when unknown hardware is utilized.
That being said, if enough customers demand it, who knows what could happen; but
today, XtremIO is delivered as HW+SW+EMC Support.
There are 6 software modules responsible for various functions in the system. The first 3
(R,C,D) are data plane modules and the last 3 (P,M,L) are control plane modules.
P Platform Module. This module is responsible for monitoring the hardware of the
system. Each node runs a P-module.
M Management Module. This module is responsible for system wide configurations. It

communicates with the XMS management server to perform actions such as volume
creation, host LUN masking, etc from the GUI and CLI. There is one active M-module
running on a single node, and the other nodes run a stand-by M-module for HA
purposes.
L Clustering Module. This clustering module is responsible for managing the cluster
membership state, joining the cluster, and typical cluster functions. Each node runs an Lmodule.
R- Routing Module. This module is the SCSI Command parser and translates all host
SCSI commands into internal XtremIO commands/addresses. It is responsible for the 2
FC and 2 iSCSI ports on the node and functions as the ingress/egress point for all I/O of
the node. It is also responsible for breaking all I/O into 4K chunks and calculating the
data hash values via SHA-1. Each node runs an R-Module.
C- Control Module. This module contains the address to hash mapping table (A2H)
which is the first layer of indirection that allows much of the magic to happen. Many of
the advanced data services such as snapshots, de-duplication, thin provisioning, etc are
all handled in this module.
D- Data Module. The data module contains the hash to physical (H2P) SSD address
mapping. It is also responsible for doing all of the I/O to the SSDs themselves as well as
managing the data protection scheme, called XDP (XtremIO Data Protection).
The function of these modules along with the mapping tables will be clearer after
reviewing how I/O flows through the system.
Read I/O flow
XtremIO Read I/O Flow

Stepping through this diagram, a host first issues a read command for a logical block
address via the FC or iSCSI ports. It is received by the R-module which breaks the
requested address ranges into 4KB chunks and passes this along to the C-module. To
read the 4K chunk at address 3 (address 3 was just picked as an example 4KB address
to read) the C-Module does a lookup and sees that hash value for the data is H4. It then
passes this to the D-module which looks up the Hash Value H4 in its Hash->Physical
Address lookup table and reads physical address D from the SSD.
Write I/O of unique data:
XtremIO Write Unique

This is an example flow for writing a unique 4KB data segment to the array. The host
issues a write I/O via FC or iSCSI. This is picked up by the R-module which breaks up
the I/O into 4KB chunks and calculates the hashes for each 4KB chunk. For the
purposes of this illustration we are focusing on just a single 4KB chunk to follow it
through the system. The R-Module hashes this 4KB of data and produces a hash value
of H5 and passes this to the C-Module. We see that the hash of H5 is unique data and
thus the C-module places it in its address mapping table at address 1. It then passes the
I/O to the D-module which assigns H5 the physical address D and writes the 4KB of data
to the SSD at this physical address.
Write I/O of Duplicate Data:
XtremIO Duplicate Write

As with the previous write example, the host issues a write via FC or iSCSI. This is
picked up by the R-Module which breaks the I/O into 4KB chunks and calculates the
hashes of each 4KB chunk. Similar to the previous example, we are just going to follow a
single 4KB chunk through the system for the sake of simplicity. In this case the R-module
calculates the hash of the 4KB chunk to be H2 and passes it the C-Module. The Cmodule sees that this data already exists at address 4 and passes this to the D-Module.
Since the data already exists, the D-module simply increments the reference count for
this 4KB of data from 1 to 2. No I/O is done to the SSD.
We can see by the write I/O flows that both thin provisioning and de-duplication arent
really features, but rather just a byproduct of the system architecture because we are
only writing unique 4KB segments as a design principle. The system was truly designed
from the ground up with data reduction for SSDs in mind. The de-duplication happens
inline, 100% in memory with zero back-end I/O. There is no turning off de-duplication
(since its a function of how the system does writes) and it carries no penalties, and in
fact boosts performance by providing for write attenuation.
VM Copy
How about a simple use case such as copying a VM?
Metadata before copy
Metadata after VM Copy

Stepping through the copy operation the ESXi host issues a VM copy utilizing VAAI. The
R-module receives the command via FC or iSCSI ports and selects a C-module to
perform the copy. Address range 0-6 represents the VM. The C-module recognizes the
copy operation and simply does a metadata copy of addresses range 0-6 (original VM)
to a new address range 7-D (represents the new VM) and then passes this along to the
D-module. The D-module recognizes that the hashes are duplicates and simply
increments the ref. counts for each hash in the table. No SSD back-end is I/O required.
In this way, the new VM (represented by address range 7-D) can reference the same 4K
blocks as the old VM (represented by address range 0-6).
The key thing to note throughout these I/O flows is that all metadata operations are done
in memory. To protect the metadata there is a very sophisticated journaling mechanism
that RDMA transfers metadata changes to remote controller nodes and hardens the
journal updates to SSDs in the drive shelves using XDP.
The magic behind XtremIO is all in the metadata management and manipulation. Its
also worth noting that the data structures utilized for the A2H and H2P tables are much
more complicated than depicted above and have been simplified in the illustrations for
the purposes of understanding the I/O flows.
The second thing to note is that the D-module is free to write the data anywhere it sees
fit since there is no coupling between a host disk address and back-end SSD address
thanks to the A2H and H2P tables. This is further optimized since the content of the data
becomes the address for lookups since ultimately the Hash Value is what determines
physical disk location via the H2P table. This gives XtremIO tremendous flexibility in
performing data management and optimizing writes for SSD.
Module Communications
With an understanding of the relationship between the R,C,D modules and their
functions, the next thing to look it as how exactly they communicate with each other.
XtremIO Module Communication

The first thing to understand is how the modules are laid out on the system. As
discussed previously each node has 2 CPU sockets and an XIOS instance runs on each
socket in usermode. We can see from the above that each node is configured very
specifically with R,C running on one socket and D running on another socket. The
reasons for this have to do with the Intel Sandy Bridge architecture which has an
integrated PCIe controller tieing every PCIe adapter directly to a CPU socket. Thus on a
system with multiple CPU sockets, the performance will be better when utilizing the local
CPU socket to which the PCIe adapter is connected. The R,C,D module distribution was
based on optimizing the configuration based on field testing. For example the SAS card
is connected to a PCIe slot which is connected to CPU socket 2. Thus the D-Module
runs out of socket 2 to optimize the SSD I/O performance. This is a great example of
where while a software storage stack like XtremIO is hardware independent and could
be delivered as a software only product, there are optimizations for the underlying
hardware which must be taken into consideration. The value of understanding the
underlying hardware goes not only for XtremIO but all storage stacks. These are the
types of things you do NOT want to leave to chance or for an end-user to make decisions
on. Never confuse hardware independence with hardware knowledge & optimization
there is great value in the later. The great thing about the XIOS architecture is that since
it is hardware independent and modular, as the hardware architecture improves XIOS
can easily take advantage of it.
Moving on to the communication mechanism between the modules, we can see that no
preference is given to locality of modules. Meaning, when the R-module selects a Cmodule, it does not prefer the C-module local to itself. All communications between the
modules are done via RDMA or RPC (depending on if its a data path or control path
communication) over Infiniband. The total budget for IO in an XtremIO system is 600700uS and the overhead by Infiniband communication 7uS-16uS. The result of this
design is that as the system scales, the latency does NOT increase. Weather there is 1
X-Brick or 4 X-Bricks or more in future, the latency for IO remains the same since the
communication path is identical. The C-module selection by the R-module is done
utilizing the same calculated data hashes and this ensures a complete random
distribution of module selection across the system, and this is done for each 4K block.
For example if there are 8 controllers in the cluster with 8x R,C,D modules there is
communication happening between all of them evenly. In this way, every corner of the
XtremIO box is exercised evenly and uniformly with no hot spots. Everything is very
linear, deterministic and predictable. If a node fails, the performance degradation can be
predicted, the same as the performance gain when adding node(s) to the system.
XDP (XtremIO Data Protection)
A critical component of the XtremIO system is how it does data protection. RAID5?
RAID6? RAID 10? None of the above. It uses a data protection scheme called XDP
which can be broadly thought of as RAID6 for all flash arrays meaning it provides
double parity protection but without any of the penalties associated with typical RAID6.
The issue with traditional RAID6 applied to SSDs is that as random I/O comes into the
array forcing updates/overwrites, the 4K block(s) need to be updated in place on the
RAID stripe and this causes massive amounts of write amplification. This is exactly the
situation we want to avoid. For example: in a RAID6 stripe if we want to update a single
4K block we have to read that 4K block plus 2 4K parity blocks (3 reads) and then
calculate new parity and write the new 4K block and two new 4K parity blocks (3 writes)
hence for every 1 write front-end write I/O we have 3 back-end write I/Os giving us a
write amplification of 300% or said another way a 3x overhead per front-end write. The
solution to this problem is to never do in place updates of 4K blocks and this is the
foundation of XDP. Because there is an additional layer of indirection via the A2H and
H2P tables XtremIO has complete freedom (within reason) on where to place the
physical block despite the application updating the same address. If an application
updates the same address with different 4K content, a new hash will be calculated and
thus the 4K block will be put in a different location. In this way, XtremIO can avoid any
update in place operations. This is the power of content aware addressing where the
data is the address. It should also be noted that being able to write to data anywhere is
not enough by itself it is this coupled with flash that makes this architecture feasible
since flash is a random access media that has no latency penalty for random I/O unlike a
HDD with physical heads. The previously described process is illustrated below.
XtremIO XDP Before Update
XtremIO XDP After Update

The basic principal of XDP is that it follows the above write I/O flow and then waits for
multiple writes to come into the system to bundle those writes together and write a
stripe to the SSDs, thus amortizing the cost of a 4K update over multiple updates to gain
efficiencies and lower the write overhead. The I/O flow is exactly the same as the Write
I/O of Unique Data illustrated in the previous section, except that XDP simply waits for
multiple I/Os to come into the system to amortize the write overhead cost. One thing to
note is that in the example a 2+2 stripe was used. In practice, the stripe size is dynamic
and XtremIO looks for the emptiest stripe when writing data. The 23+2 stripes will run
out quickly (Due to the holes created by old block, denoted by the white space in the
figure. It should be noted that these holes will be overwritten by XDP to facilitate large
stripe writes. However the math behind this is complex and beyond the scope of this
article), however even if a 10+2 stripe is found and used , the write
amplification/overhead is reduced from 300% (3 back-end writes for every 1 front-end

write) to 20% (12 back-end writes for every 10 front-end writes), and this is what XtremIO
conservatively advertises as overhead:
XDP vs RAID Overhead

However, in practice, even on a 80% full system, it is likely that stripe sizes much larger
than 10+2 will be found leading to even less than a 20% write overhead. Even at 20%,
the write overhead is not only less than RAID6 but also much less than RAID1! All the
while providing for dual parity protection. So better than RAID1 performance, with the
protection of RAID6 which, in a nutshell, sums up XDP.
EMC XtremIO GUI Overview and

Storage Provisioning
Lets take an Overview of XMS GUI:
Login to the XMS server using IP address.
We can see a below dashboard
Configuration Tab:
Hardware Tab:
Event Tab:
Monitor Tab:
Administration Tab:
Storage Provisioning:
Note: For Provisioning Storage First the Host needs to be zoned
with Storage array.
Steps involved in storage provisioning:
1.
Creating an Initiator folder and adding members to the folder.
2.
Creating an Storage folder
3.
Adding Volumes to the Storage folder.
4.
Masking View update.

Storage provisioning is lot more easier in XtremIO GUI. Lets go
through the steps.
Step 1:
Go to configuration Tab.
Step 2:
Click on the Add volume (Highlighted in red).
Step 3:
Click on Add Multiple (Highlighted in red).
Step 4:
Specify volume name and size.
Step 5:
Create a New Folder where pervious created Volumes will be kept.
Step 6:
We can see paras folder and Paras_xio_01 volumes created.
Step 7:
For creating Initiator group, click on add and select the PWWN.
Step 8:
Give Parent folder for Initiator group.
Step 9:
Click on the volumes and Initiator group for creating a Making view.
Step 10:
Click on Map All and then click apply. Storage is visible to host.
Above Storage was assigned to an ESX host. From Vsphere client we

have to scan devices. Lets see the steps for scanning for new
devices.
Step 1:
Click on storage and select devices to view list of devices connected
to ESXi Host.
Step 2:
Click on rescan new for identifying new devices connected to ESXi
host.

XtremIO Architecture and Allocation

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

XtremIO Architecture and Allocation

Hochgeladen von

Copyright:

Verfügbare Formate

Introducing XtremIO

The 40Gbps IB ports are for the back-end connectivity.

At GA, each node is essentially a dual socket 1U whitebox server utilizing 2x

XtremIO Scaling Performance

Snapshots. No capacity or performance penalty due to the data management and

Full VAAI Integration.

XtremIO Software Architecture

M Management Module. This module is responsible for system wide configurations. It

XtremIO Read I/O Flow

XtremIO Write Unique

XtremIO Duplicate Write

Metadata before copy

Metadata after VM Copy

XtremIO Module Communication

XtremIO XDP Before Update

XtremIO XDP After Update

amplification/overhead is reduced from 300% (3 back-end writes for every 1 front-end

XDP vs RAID Overhead

EMC XtremIO GUI Overview and

Login to the XMS server using IP address.

We can see a below dashboard

Creating an Initiator folder and adding members to the folder.

Creating an Storage folder

Adding Volumes to the Storage folder.

Masking View update.

Above Storage was assigned to an ESX host. From Vsphere client we

Das könnte Ihnen auch gefallen