Sie sind auf Seite 1von 268

OceanStor Dorado

Born for Mission Critical Business

Storage CTO
The Cutting Edge of Storage Innovation

2
Primary Storage Leader in Gartner Magic Quadrant
2018 MQ for General-Purpose Arrays 2019 MQ for Primary Storage 2018 MQ for Solid-State Arrays

“Huawei has progressively become one of the leading providers of primary storage on the global stage.”
“Its external enterprise storage portfolio for primary storage workloads - OceanStor - spans all market segments”
“Huawei announced new versions of OceanStor Dorado6000 V3 and Dorado18000 V3 that support internal NVMe SSD ”
“Huawei’s SmartVirtualization plus SmartMigration software enables users to nondisruptively migrate data from competitive external enterprise storage systems
to OceanStor, or to migrate from an older OceanStor platform to a new OceanStor platform. ”
— Quote by Gartner

3
The Cutting Edge of Storage Innovation

4
OceanStor Dorado Product Portfolio

Entry-Level Mid-Range High-End

Model 3000 5000 6000 8000 18000

Height / Controllers of Each Engine 2U/2C 2U/2C 2U/2C 4U/4C 4U/4C

Controller Expansion 2-16 2-16 2-16 2-16 2-32

Cores in Each Controller 24 64 96 128 192

Maximum Disks 1200 1600 2400 3200 6400

Cache/Dual Controller 192G 256G/512G 512G/1024G 512G/1024G/2048G 512G/1024G/2048G

Front-end ports 8/16/32G FC, 1/10/25/40/100G Ethernet

Back-end ports SAS 3.0 SAS 3.0/100G Ethernet

5
FlashLink ® - The Foundation of Evolution
Multi-Protocol Network Chip: Hi1822 AI Chip: Ascend 310 Processing power requirement
• Support both FC and Ethernet • AI SoC for small scale training • > 1 TeraFLOPS for real-time analytics

Ascend 310 capability


• FP16:8 TeraFLOPS
• INT8:16 TeraOPS
• Max power:8W

Real-time analytics
• Data Correlations;Data Similarity;
BMC Chip: Hi1710 Adaptive Optimization;Health Analytics;
• Troubleshooting accuracy 93% Data Temperature;Failure Prediction
Use case
• Intelligent Cache
Array Controller Chip: Kunpeng 920 SSD Controller Chip: Hi1812e
• Smart QoS
• SPECint 930+, #1 performance ARM • Half the latency of previous model
processor • Intelligent Data Dedup
• Processor embedded intelligent disk • ……
enclosure

6
Kunpeng® CPU - The Heart of New Storage

Submission Completion
Queue Queue

48
Core
Submission Completion
Queue Queue

Submission Completion
Queue Queue

7
SmartMatrix - Symmetric A/A Controller Architecture
Engine Engine
Shared Shared Shared Shared Shared Shared Shared Shared
Frontend Frontend Frontend Frontend Frontend Frontend Frontend Frontend

Storage Storage Storage Storage Storage Storage Storage Storage


Controller Controller Controller Controller Controller Controller Controller Controller

Shared Shared Shared Shared Shared Shared Shared Shared


Backend Backend Backend Backend Backend Backend Backend Backend

RDMA
Network

DAE DAE DAE DAE


Intelligent Intelligent Intelligent Intelligent Intelligent Intelligent Intelligent Intelligent
DAE Controller DAE Controller DAE Controller DAE Controller DAE Controller DAE Controller DAE Controller DAE Controller

• Symmetric active/active controller with fully • Persistent cache mirroring with max of 3 copies • End to end NVMe support
meshed topology • Non-disruptive firmware upgrade, IO hang-up • Backend RDMA network over 100Gb/s Ethernet.
• Shared everything architecture from frontend, time is limited within 1 second • SCM support for read acceleration*
backend, to drive enclosure

8
OceanStor Dorado - New Gen of Mission Critical Storage

OceanStor Dorado OceanStor Dorado


(2017) (2019)

Max. Performance 7M IOPS 20M IOPS

Max. Storage Controller 16 32

NVMe Support Back-End End-to-End

Backend Network SAS/PCIe 100Gb RoCE v2

SSD Form Factor 25 Drive/2U Shelf 36 Drive/2U Shelf

SSD Shelf Standard DAE (No CPU) Intelligent DAE

Data Deduplication Fixed-Length Fixed & Variable Length

Controller Fault Tolerance 1 of 2 7 of 8

Engine Fault Tolerance N/A 1 of 2

Artificial Intelligence (AI) N/A AI Module with Ascend Chip


9
Commitment to Business Continuity

10
Every Second is Valuable

Application Each second of timeout to mission critical


1XX Seconds
business could be:
• Tens percent of transaction lost
Operating System
• Thousands of dollars profit lost
XX Seconds
• Tens of thousands unsatisfied customer (specifically on black
Friday)
Host Bus Adapter (HBA) Some of the big FSI enterprise require
XX Seconds
storage to minimize timeout to 1 SECOND,
e.g.
Network
1X Seconds • Industrial and Commercial Bank of China
• Itaú Unibanco
• China Construction Bank
Storage • Agricultural Bank of China
X Seconds
• ……

11
Time is Money

Profit/Hour of TOP 10 Banks


$6,000,000

$5,000,000
$5,000,000
$4,300,000

$4,000,000
$3,400,000
$3,000,000
$3,000,000
$2,400,000 $2,500,000

$2,000,000
$1,300,000 $1,240,000
$970,320 $913,242
$1,000,000

$0

12
One-Second Controller Failover

Solution A Solution B

FE FE FE FE FE FE FE FE

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

BE BE BE BE BE BE BE BE
IOPS

IOPS

IOPS
4S 6S 6S 9S >9S 1S 1S 1S

Time Time Time

* The above figures are referring to the testing result in Huawei lab.

13
One-Second’s Magic - Shared Frontend Adapter
Engine Server
Shared Shared Shared Shared
Frontend Frontend Frontend Frontend

Storage Storage Storage Storage


Controller Controller Controller Controller Data

Shared Shared Shared Shared


Backend Backend Backend Backend
FC/Eth
Network

• Frontend adapter holds the connection with server Engine


independently, storage controller is not involved. Shared Frontend Shared Frontend Shared Frontend Shared Frontend
Data

• Normally, each I/O will be directed to one storage


controller through back plane
Back Plane

• If the controller was failed, the I/O will be redirected to


Storage Controller Storage Controller Storage Controller Storage Controller
other survived controllers, while the connection between
the frontend adapter and the server is still keeping as
normal, the server is not aware of the failure.

14
Multiple Controller Fault Tolerance

FE Front-End Adapter

Ctrl Controller Any 2 of 8 Controllers 1 of 2 7 Controllers


BE Back-End Adapter Simultaneous Failure Engine Failure Continuous Failure
Data Copy in Cache

Engine Engine Engine Engine Engine Engine


FE FE FE FE FE FE FE FE FE FE FE FE
Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl
BE BE BE BE BE BE BE BE BE BE BE BE

DAE DAE DAE

15
Best-of-Breed Reliability & Availability
Solution A Solution B
• Frontend adapter can’t be shared • Frontend adapter can be shared to all • Frontend adapter can be shared to
between controllers in one engine. the controllers in one engine. all the controllers in one engine.
• LUN has to be owned by a single • LUN ownership is eliminated.
controller.

Engine Engine DKC DKC Engine Engine


FE FE FE FE FE FE FE FE FE FE FE FE

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl

Ctrl
Ctrl Ctrl Ctrl Ctrl

BE BE BE BE BE BE BE BE BE BE BE BE

DAE DAE DKU DKU DAE

• SSD enclosure can be shared by all • SSD enclosure can be shared by all • SSD enclosure can be shared by all
controllers in one engine. controllers in one engine. controllers in multiple engines

16
Firmware Non-Disruptive Upgrade (NDU)

Storage Firmware
Service • 94% of firmware components
1S • Modular design
• Online upgrade
• One-second to active
Manage
Data …… • No connection loss with server
ment
Inter- • Transparent to application
Protocol Control Commun
ication

Kernel • 6% of firmware components


• Rolling upgrade

17
Intelligent DAE - The SSD Shelf with Processing Power

Engine
Storage Controller Offloading
FE FE FE FE • Each DAE has two controllers, and each controller has its own
processor, cache and adapter.
Data Erasure Data
Ctrl Compression Ctrl Coding Ctrl Rebuilding Ctrl • DAE controller takes over some of workloads from array controller,
including:
BE BE BE BE - Data rebuilding
- Erasure Coding (EC)*
- Data compression*
DAE
With the help of DAE controller, DAE is much more intelligent than
Intelligent DAE Controller Intelligent DAE Controller
ever, this distributed computing design half the data rebuilding time,
and reduce the performance impact (max. IOPS) to controller from
15% to 5%, the bandwidth of data rebuilding is increased from 80MB/s
to 200MB/s while the array controller’s CPU Util% remains at 70%.

* The workloads will be available in near future, including garbage collection.

18
Comprehensive HA/DR Solutions

Site A Site B Synchronous Asynchronous


HyperMetro
WAN
WAN
Oracle RAC

Asynchronous

Asynchronous
A B A B
VMware VSphere
Fusionsphere WAN WAN
Cluster
...

C C

FC/IP WAN FC/IP Serial


Network Network
Synchronous Asynchronous
HyperMetro
WAN

A B A B
Synchronous Mirroring

C C
IP IP
Network Network Parallel
Production Storage Production Storage Synchronous Synchronous
HyperMetro HyperMetro

Asynchronous

Asynchronous
Site C A B A B

Standby
WAN WAN

Quorum Server C C
Star

19
More Robust Storage HA Cluster
Scenario #1 Scenario #2 Scenario #3
Solution A

#1 #1
#2 #2
Scenario #4 Scenario #5 Scenario #6 #3 #3
#4 #4
#5 #5
Scenario #7 Scenario #8 Scenario #9
#6 #6
#7 #7
#8 #8
Storage Witness/Quorum
#9 #9
20
Extreme Performance Experience

21
Extreme Performance Experience
DB Acceleration VM Delivery VDI Support
Transaction Per Second

7200
57,000 1.5 Minutes VDI
52 Minutes

1500
VDI

11,500

3.84TB SSD*40, SwingBench OE2 transaction generator 100 VM clone, 50GB each 3.84TB SSD*100 with data reduction

OceanStor Dorado 6000 vs Solution A High-End AFA, Dual Controller


22
Consistent Performance Experience
GC + 80% Pre-Conditioning
Snapshot
100% RAID 5 Performance RAID TP / RAID 6 / RAID 5 100%
Base Line Inline Compression
90% 90%
RAID 6 RAID Inline Dedup
Performance Impact
80% 350K 80%
HA Storage Cluster
GC + RAID 6 Advanced Feature
70% 300K Inline Compression Performance Impact 70%

60% 250K 60%


Inline Dedup
50% 200K 50%
HA Storage Cluster
40% 150K 40%

30% 100K 30%

20% 50K 20%

10% 10%

0%
43.4% Test Case: Mixed workload, 8K, 7:3/1ms average latency/8x LUN, 32x outstanding 78.9% 0%

23
End-to-End Load Balancing
I/O I/O I/O I/O

Engine Shared Front-end Adapter


• Requests from host can be evenly distributed on every front-end link
Frontend Frontend Frontend Frontend
Adapter Adapter Adapter Adapter • LUNs are shared by all controllers (aka no controller ownership).

Global Cache
Storage Storage Storage Storage
Controller Controller Controller Controller • Write I/O requests for single LUN can be placed into cache space from
multiple storage controllers.
Backend Backend Backend Backend • For better cache read hit, storage controller can place the prefetched data in
Adapter Adapter Adapter Adapter
global cache for potential read requests from any front-end link

DAE Global Storage Pool


• Global storage pool can be accessed by multiple storage controllers.
• With RAID 2.0+, multiple LUNs are distributed over multiple SSDs naturally.

24
CPU Resource Dynamic Scheduling

Engine CPU Core Group • LUN Space Sharding


LUN
Each LUN is sliced into multiple

Storage Controller pieces (aka shard), and each shard


Core Core Core
…… Data I/O Read will be mapped to a specific CPU in a
Switching #1
Slice#10 storage controller for relevant I/O
Core Core

Slice#9 processing
Storage Controller Core Core Core • CPU Core Grouping
Slice#8
I/O Read I/O Read
Slice#7 & Write #2 CPU cores will be divided into
Core Core
multiple groups, each group will be
Slice#6
assigned with a specific job.
Slice#5 Storage Controller Core Core Core
Data I/O Write • Dynamic Scheduling
Slice#4
…… Flushing #1
Core Core Higher priority jobs can acquire more
Slice#3 core from shared core groups.
Slice#2 Storage Controller Core Core Core • Workload Isolation
Data IO Write
Slice#1 Reduction #2 Each CPU core has its own I/O
Core Core
Slice#0 request to process to avoid interlock

25
Powered by NVMe and RoCE
Server The Latest Protocol & Network Standard
• 50% of latency reduction can be done with the latest
protocols (NVMe & RoCE v2)

RoCE
/FC Front-End/Back-End Adapter Protocol Offloading
Engine • 10% of latency reduction can be done with:
Frontend Frontend Frontend Frontend
Adapter Adapter Adapter Adapter ‐ Self-developed TOE frontend adapter chip.

‐ ASIC IO balancing/distribution

50us

30us
Storage Storage Storage Storage
Controller Controller Controller Controller
Intelligent DAE and Self-Developed SSD
• Read priority technology: Read requests on SSDs are
Backend Backend Backend Backend
Adapter Adapter Adapter Adapter preferentially executed to respond to hosts in a timely
manner. The latency in hybrid scenarios is reduced by 20%.

100us
30us
DAE • 30% of performance improvement for SAS DAE connection
multiplexing technology

26
Load Balancing - Core Demand of Mission Critical Application
Database
Activelog Activelog Activelog Activelog Activelog Activelog Activelog Activelog
Bank X Case Study
#1 #2 #3 #4 #5 #6 #7 #8

Activelog Activelog Activelog Activelog Activelog Activelog Activelog Activelog Database Log Switching
#9 #10 #11 #12 #13 #14 #15 #16
• The customer in FSI was using DB2 for core banking application, 24 active
Activelog Activelog Activelog Activelog Activelog Activelog Activelog Activelog
#17 #18 #19 #20 #21 #22 #23 #24 logs were activated in circular mode, one log per hour.
• In each hour, only one active log was busy.

DKC
LUN Ownership
FED FED FED FED
• The customer enabled “DB2 full logging” for potential problem analysis,
therefore, the workload got much higher than before.
VSD VSD VSD VSD
• The storage also met performance issue accordingly, because of the
BED BED BED BED mapping relationship between active logs and storage controllers were
fixed either (also known as: LUN ownership).
• The I/O workload on specific storage controller can’t be shared by other
DKU
controllers, unbalanced workload led to performance bottleneck.

27
Processor-Level Load Balancing

Data
Solution B Data

Engine (DKC #0) Engine (DKC #1) Engine #0 Engine #1

FE FE FE FE FE FE FE `FE

Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc

Proc
BE BE BE BE BE BE BE BE

SSD Enclosure (DKU owned by DKC #0) SSD Enclosure (DKU owned by DKC #1) SSD Enclosure (shared) SSD Enclosure (shared)

• Owner controller has to take most of the workload. • Workload can be spread out across all the controllers owned by the
• The 2nd engine cannot be involved in load balancing. 1st and 2nd engine at processor level.
• SSD enclosures owned by the 2nd engine cannot be shared to the • SSD enclosures are shared between engines via RDMA network,
1st engine, therefore, write I/O flushing is constrained within one and data could be flushed from both engines.
engine.

28
Business Always-On

29
Business Always-On with Lower TCO
Cost

Traditional Solution

Huawei Solution
FlashEver: FlashEver:
Replace Intermix of
Controller various gens
Module Only of DAE
No Downtime No Downtime
70% max
90+%

Initial Purchase Upgrade Tech Refresh Upgrade Upgrade Year


Labor

Traditional Solution

Huawei Solution

No Data
Less Data
Migration
Migration
No Cabling

Deployment Optimization Provisioning Data Migration Provisioning Data Migration Year


& Provisioning & Provisioning

30
Non-Disruptive Tech. Refresh

FlashEver Program Storage Federation Smart Virtualization

• Support controller upgrade with Non- • Up to 128 controllers • Virtualize third party storage by taking over
Disruptive, even include next several • Support OceanStor Dorado and the following the access path
generations by 10 years generations • Reuse old storage to protect customer
• Tech refresh the existing assets to obtain the • Can mix different gens of OceanStor Dorado investment
advantages of the latest technology in one federation cluster • Smoothly cutover the business to run in
• Support data mobility non-disruptively, and OceanStor Dorado and the following
online node reorganization generations

31
FlashEver & Storage Federation Use Case

Storage Controller Tech-Refresh


Replace existing storage controller only with the
next gen controller w/o application downtime and
data migration (DIP upgrade-Data In Place upgrade)

SSD Enclosure EOL


After the replacement from old controller to new
one (DIP), if SSD & disk enclosures get EOL later,
new enclosure could be added and data migration
could also be done w/o application downtime

Whole System Tech-Refresh


Replace the whole storage system w/o application
downtime, data migration could be done internally
within a cluster (Storage Federation)

32
Incomparable Flexibility for The Next Decade
Solution A Solution B
Cost

Traditional Solution

Huawei Solution

N/A N/A Storage Federation

Manage and manipulate data movement


across multiple various gens of array

N/A N/A Mix Use of Gens of DAE

Eliminate data migration as much as


possible to simplify capacity upgrade

N/A FlashEver Program

Only available for VSP G1000 upgrade to Replace old storage controller module
VSP G1500, a temporary design. only, protest investment as much as
Upgrade
Initial Purchase Upgrade Tech Refresh possible
Upgrade Upgrade
Year Year

33
Wrap Up

Strong Capability of Intelligent Chips Development Symmetric A/A Storage Controller Architecture
• Array Controller • Shared Frontend & Backend Adapter
• BMC • Fully-Meshed Topology
• Multi-Protocol Chip (FE/BE Adapter) • No LUN-Ownership
• SSD Controller • Cross Engine Load Balancing
• AI Chip

The Highest Level of Availability Distributed Computing Design


• Tolerate Any 2 of 8 Controllers Fault • Intelligent DAE
• Tolerate Any 1 of 2 Engine Fault • Frontend And Backend Adapter TOE Engine
• 1 Second Controller Failover And NDU
Incomparable Flexibility
Comprehensive HA/DR solution • FlashEver
• A/A Storage Cluster • Storage Federation
• Serial/Parallel/Star Topology 3DC • SmartVirtualization

34
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2018 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
OceanStor Dorado V6
Architecture and Key Technology

Security Level: Internal Only


Overview of OceanStor Dorado V6
Entry-Level Mid-Range 高端 High End
Dorado18000 V6

Dorado8000 V6
Dorado6000 V6
Dorado3000 V6 Dorado5000 V6

Entry-Level Mid-Range High End


Type Dorado3000 V6 Dorado5000 V6 Dorado6000 V6 Dorado8000 V6 Dorado18000 V6
Height / Controllers of
2U/2C 2U/2C 2U/2C 4U/4C 4U/4C
each Engine
Controller Expansion 2-16 2-16 2-16 2-16 2-32
Maximum Disks 1200 1600 2400 3200 6400
Cache/Dual Controller 192G 256G/512G 512G/1024G 512G/1024G/2048G 512G/1024G/2048G
Front-end ports 8/16/32G FC, 1/10/25/40/100G Ethernet
Back-end ports SAS 3.0 SAS 3.0/100G Ethernet

2 Huawei Confidential
OceanStor Dorado V6: The Cutting Edge of Storage Innovation

3 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency

4 Huawei Confidential
New Generation Innovative Hardware Platform
Extremely Reliable Extreme Performance Cost Effective

Rear Panel Front Panel

High-end
controller
enclosure

4U, 28 shared interface slots 4U, 4 controllers per controller enclosure

Mid-range
controller
enclosure
2U, 2 controllers per controller enclosure
2U, 36 NVMe SSDs(high density)
Entry-level
controller
enclosure
2U, 2 controllers per controller enclosure

Intelligent DAE 2U, 25 SAS SSDs

2U, 2 controllers per controller enclosure

5 Huawei Confidential Standardization High density


Extremely Reliable Extreme Performance Cost Effective

Controller design for high-end series

6 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Controller design for Middle-range series

BBU PSU PSU BBU

IO card IO card
CPU CPU
SSD
100GE 100GE
SSD
IOB IOB
IO card SSD IO card
CPU SSD CPU
IO card IO card
100GE

FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN

7 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Controller design for Entry-Level series & Intelligent DAE

BBU PSU PSU BBU

IO card SSD IO card


CPU CPU
SSD
IO Card IO Card
SSD
IOB IOB
SSD
IO Card IO Card

100GE

FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN

8 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

2U, 36 disks, high capacity density


Traditional architecture design Dual Horizontal Orthogonal
1. The heat dissipation
Architecture Design
1. The window area
window is small and the
PALM increases by 50%, and
wind resistance is large.
Disk the heat dissipation
capability increases by
2. Double-sided connector,
interfering with each 25%.
2. Orthogonal
other. The number of
connection without
hard disks is limited.
dual-side interference,
increasing the number
Horizontal backplane and
>25-disk double-sided
connectors cannot be staggered orthogonal connection of hard disks by 44%
2U integrated equipment,
36 Palm SSDs, 44% SSDs
increasing in industry
Traditional User-defined Palm form
Size:
Size:
100.6*14.8*70
160*9.5*79.8
Volume:
Volume:
103cm³
Dual-port Palm SSD 121cm³
U.2 NVMe SSD
The number of 44% disk slots is added to the width of the 19inch cabinet.

9 Huawei Confidential
Same capacity, width reduced by 36%
Extremely Reliable Extreme Performance Cost Effective

Innovative Hardware Platform with self-developed chipsets


Network Chip Hi1822
• lower Network latency 160μs80μs
• Intelligent failover between controllers
• Non-disruptive upgrade

CPU Chip Kunpeng 920


• NO.1 ARM-Based CPU, 930+ SPECint
• Intelligent enclosure, CPU integrated.

SSD Chip Hi1812e


• 50% lower Latency
• 20% longer life cycle

BMC Chip Hi1710


• 90% less fault recovery time (120 min -> 10 min)

AI Chip Ascend 310


• AI SoC for mini-scale training

10 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Kunpeng 920, the best processor for storage

High concurrency Up to 48 cores in one CPU

48 Core
High integration Not only computing

Acceleration
8-Channel

100G RoCE
& SAS 3.0
PCIe 4.0

Engine
DDR4

High density 4 Sockets in 1U Space

11 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Dorado V6 Design Principle - Distributed, Full-Mesh, and Global Shared Resource

FC/FC-NVMe/NVMe Over Fabric/iSCSI


Distributed Architecture
• Consistent Distributed Architecture for High-end/Mid-
range/Entry-level series of Dorado V6.
Shared Shared Shared Shared Shared Shared Shared Shared
• Symmetric Active-Active Cluster(supports symmetric
Frontend Frontend Frontend Frontend Frontend Frontend Frontend Frontend access by the hosts).
• Load-balancing between all the controllers and auto-
Storage Storage Storage Storage Storage Storage Storage Storage rebalancing upon scale-out, failover and failback.
Controller Controller Controller Controller Controller Controller Controller Controller

Shared Shared Shared Shared Shared Shared Shared Shared End-to-End NVMe
Backend Backend Backend Backend Backend Backend Backend Backend
• Front-end: NVMe over FC(32G)/NVMe over Fabrics(RoCE).
Controller Enclosure Controller Enclosure
• Back-end: NVMe SSD/Intelligent DAE.

NVMe Over Fabric (RoCE)


Global Shared Resource
• Global cache and global storage pool for all LUNs
• High-end series support Shared Frontend module, Shared
Intelligent Intelligent Intelligent Intelligent Intelligent Intelligent Intelligent Intelligent Backend module(100GE RDMA)
DAE DAE DAE DAE DAE DAE DAE DAE

12 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency

13 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Introduction of Connectivity Host IO


Every physical port connects
SmartMatrix Technology over 100GE RDMA port port
all the four controllers in one
engine
controller controller

controller controller

Shared Shared Shared Shared


Front-End Front-End Front-End Front-End

48 48 48 48
Full Mesh
A B A B A B A B cores cores cores cores interconnection
48 48 48 48 between all controllers
C D C D C D C D cores cores cores cores
in each engine
48 48 48 48
SharedShared Back-End
Shared SharedShared Back-End
Shared cores cores cores cores Shared interconnection
Back-End Back-End Back-End Back-End 48 48 48 48 module for connecting
cores cores cores cores
between the engines

A B A B One intelligent disk enclosure can be


accessed by 8 controllers(2 engines)
C D C D through the shared back-end module
Back-End Back-End
port port

14 Huawei Confidential
Intelligent DAE
Extremely Reliable Extreme Performance Cost Effective

Intelligent Front-end Connection - No Sense for Controller failover


Controller Enclosure Server
Shared Shared Shared Shared
front end front end front end front end

Controller Controller Controller Controller


Data

Shared Shared Shared Shared


back end back end back end back end
FC
Network

• The FIMs are linked to servers independently regardless of Controller Enclosure


controller failures. FIM FIM FIM FIM
Data

• Generally, each I/O is directed to a specific storage


controller through the backplane.
Back plane

• If a controller is faulty, the corresponding I/O is redirected


to another normal controller, and the link between the FIM Controller Controller Controller Controller
and the server is not interrupted.

15 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

High availability Architecture(HyperMetro-inner for High-end series)


Tolerance of 2 controllers failure simultaneously Tolerance of 1 engine failure Tolerance of 7 controllers failure

Shared Front-End Shared Front-End Shared Front-End Shared Front-End Shared Front-End Shared Front-End

A A’ A’’ A A’ A’’ A A’
B’ B B’’ B’ B B’’ B’ B
C C’ C’’ C C’ C’’ C C’
D’ D D’’ D’ D D’’ D’ D
E’’ E E’ E’’ E E’ E E’

F’’ F’ F F’’ F’ F F’ F

G’’ G G’ G’’ G G’ G G’

H’’ H’ H H’’ H’ H H’ H

A B C D E F G H A B C D E F G H A B C D E F G H
Shared Back-End Shared Back-End Shared Back-End Shared Back-End Shared Back-End Shared Back-End

Intelligent Intelligent Intelligent


DAE DAE DAE

• Global Cache supports 3 copies across two engines. • Global Cache supports 3 copies across two engines. • Global cache provides continuous mirroring technology
• Guarantee at least 1 cache copy available if 2 controllers failed • One disk enclosure can be accessed by 8 controllers(2 • Tolerates 7 controllers failure one by one of 8 controllers(2
simultaneously. engines) through the shared back-end module engines)
• Only one engine can also tolerate 2 controllers failure at the • Guarantee at least 1 cache copy available if one engine failed.
same time with 3 copies Global Cache
16 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

The best Active-Active design ① 3 Copied Cache ② RoCE network ③ Intelligent DAE

Vendor1: Vendor2: Huawei Dorado V6


2 Controllers Scale-out 4~8 Controllers Scale-out
IO interface belongs to one controller, IO interface belongs to one controller, Shared front-end, controller failure is
controller fails causes link switch over. controller fails causes link switch over. transparent to the host

Front-End

Controller

Back-End

Disk enclosure shared with dual- Disk enclosure shared with four- Global cache provides continuous mirroring
controller, dual-controller(one engine) controller, 4 controller failure(one technology and 3 copies across 2 engines
failure causes service interruption. engine) causes service interruption. Disk enclosure shared with 8 controllers.
No service interruption: any 2 controllers failed
at the same time; 1 engine failed; 7 controllers
17 Huawei Confidential failure one by one of 8 controllers(2 engines)
Extremely Reliable Extreme Performance Cost Effective

Multi-level reliability technology combination

SmartMatrix

Component reliability Product reliability Architecture reliability Solution reliability


Global disk protection RAID-TP SmartMatrix A-A without gateway
Reliability first in the industry Tolerate 3 disks Business continuity, no fault
at the same time Full meshed architecture
Reliability first in the industry

99.9999% high availability for the most demanding enterprise reliability needs
18 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Self-developed SSD disk


Global wear leveling Huawei's patent: global anti-wear leveling

Storage pool Storage pool

RAID 4 is supported in SSDs Dorado supports RAID 5/6/TP,


to ensure data reliability tolerating simultaneous failures of up
to three disks

19 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

RAID 2.0+

Hot Hot
spare spare

Traditional RAID LUN virtualization RAID2.0+ Block virtualization

Data reconstruction speed is improved 20 times


 Huawei RAID2.0+: bottom-layer media virtualization + upper-layer resource virtualization for fast data reconstruction and smart resource allocation
 Fast data reconstruction: Data reconstruction time is shortened from 5 hours to only 15 minutes. The data reconstruction speed is improved 20-fold. Adverse
service impacts and disk failure rates are reduced.
 All disks in a storage pool participate in reconstruction, and only service data is reconstructed. The traditional many-to-one reconstruction mode is transformed
to the many-to-many fast reconstruction mode.

20 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Dynamic Data Reconstruction

N+M (N-1)+M

T0 D0 D1 D2 D3 P Q

T1 D0

D0’ D1’ D2’ P’ Q’


T2

Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6

21 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Offload data reconstruction to Intelligent disk enclosures

Controller 3. Reconstruct data is Controller 5 Data reconstruction


written into hot spare using P’ Q’ P” and Q”.
Read Disk Calculation Verify space. Then data is written
into hot spare space..
1.1 Reconstruction
1.2 Reconstruction

Intelligent DAE A Intelligent DAE B


DAE A DAE B 4.1 Transmits P’ 4.1Transmits P”
and Q ‘ and Q”
2. Obtain 23
1. Read data Read Calculation Read Calculation
pieces of data
from disk Disk verify Disk verify

2.2 Read data


3.1 Obtain 12 3.2 Obtain 11
2.1 Read data from disk

Chunk Chunk
… Chunk Chunk Chunk Chunk
… Chunk Chunk
Intelligent
from disk

Chunk Chunk …
pieces of data

Chunk Chunk
pieces of data

Chunk Chunk … Chunk Chunk

SSD SSD SSD SSD SSD SSD SSD SSD Offload SSD SSD SSD SSD SSD SSD SSD SSD

The reconstruction bandwidth of a single disk in the 23+2 RAID group is reduced from 24 times to 5 times

22 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Comprehensive HA/DR Solutions


HyperMetro
Gateway-free, Active-Active, Static + Dynamic Quorum Smooth upgrade to 3DC

Site A Site B Synchronous Asynchronous


HyperMetro
WAN
WAN
Oracle RAC

Asynchronous

Asynchronous
A B A B
VMware Vsphere
Fusionsphere WAN WAN
Cluster
...

C C

FC/IP WAN FC/IP Serial


Network Network
Synchronous Asynchronous
HyperMetro
WAN

A B A B
Synchronous Mirroring

C C
IP IP
Network Network Parallel
Production Storage Production Storage Synchronous Synchronous
HyperMetro HyperMetro

Asynchronous

Asynchronous
Site C A B A B

Standby
WAN WAN

Quorum Server C C
Ring
23 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Efficient data protection based on RoW

Creating a Clone
Source
Using a Read-
Only Snapshot
HyperClone_1 ① Clone from read-only snapshot
TP0 HyperCDP_0
HyperClone_2

TP1 HyperSnap_0 ② Freely Snapshot rollback


TP0 HyperSnap_1
TP1 Cross-level
Mixed cascading snapshot and

rollback TP0 HyperSnap_2
TP2 Cascading
snapshot TP1 clone
TP2 HyperClone_3
TP2
TP0 HyperSnap_4

TP1 HyperClone_4
Remarks:
TP0, TP1, and TP2 are examples. Snapshot and clone can be associated with any time point of the Cascading
source LUN. clone

HyperCDP – Data Protection HyperSnap – Read & Write HyperClone – Data Mirror

24 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

HyperCDP - 60,000 protection points, once every 3 seconds

COW ROW
Mapping table Data space Mapping table Data space

1 A
F 3 Add the old data
F 2 B block into the list of
2 Write new data at B
3 C blocks to be
the original released.
location. 4 D
5 2 E

B Copy data from the Modify the LUN F 1 Write new data into
1 original location to mapping table and a new physical
the new location. other metadata. location.
1 read, 2 writes, and 1 metadata update 1 write and 1 metadata update
25 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Non-Disruptive Upgrading (NDU)

No business impact, no performance loss


 94% components in user mode, upgrading in 1s, no host connection loss by switching on global sharing frontend card

 6% components in steady kernel, upgrading with rebooting in minutes

1
second
Service Service Service Service Service

Inter
Manage
Protocol Data Control Communi
-ment
-cation

Stable OS Kernel

26 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Online upgrading without business disruption


Host I/O

HiSilicon SmartIO Front-


Hi1822
end interface

I/O

System
System Device
Device Configuration
Configuration UI/CLIUI/CLI
UI/CLIUI/CLI
Management
Management Management
Management IO Processing
Processing process
process Management
Management Management
Management
process
process process
process process
process process
process

② Self-designed Chips
① User mode upgrading, no ③ IO Processing
hold IOs and maintain ④ Host IO recovers
system reboot Process upgrading
host link information

Self-designed chips hold IOs and keep


links alive.  IO Processing process upgrading period less than <1S
 Host link keeps online during upgrading
 Support single link Firmware upgrading
Modular software architecture design,
27 Huawei Confidential upgrading of component in 1S
Extremely Reliable Extreme Performance Cost Effective

Summary: Multi-level reliability technology combination

 HyperMetro: Gateway-free active-active solution (1 ms

99.9999%
Solution latency)
 FlashEver without data migration
 HyperCDP, HyperSnap, HyperClone,
HyperReplication, 3DC
system level reliability
 RAID-TP: Tolerance for simultaneous failure of 3

99.99999%
disks
System  RAID 2.0+, Dynamic Reconstruction
 End to End DIF
 Intelligent DAE
 Non-Disruptive Upgrade
solution level reliability
Architecture  SmartMatrix fully-interconnected architecture
tolerates failure of 7 out of 8 controllers
Ever Solid
 Global wear leveling
Component  Huawei-patented global anti-wear leveling
 Intelligent IO Module
 Built-in dynamic RAID utilizes space

24
28 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency

29 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

End-to-End Symmetric Architecture Symmetric interface


• All series Support Active-Active access mode of the hosts,
requests can evenly distribute on every frontend link
• LUNs of all series have no ownership controller, easy for
Host use and load balance(LUNs are divided into slices and
slices are distributed evenly on all the alive controllers by
SAN using DHT algorithm)
• High-end series provide shared and intelligent frontend
IO module which can divide LUNs into slices and send the
requests to their target controller for reducing latency
Hash sharding

DHT Global Cache


• IOs(located in one or more slices) of LUNs can be written
to the cache of all the controllers and then be responded
to the host
Global cache
• The intelligent read cache of all the controllers can pre-
fetch all the LUNs’ data and meta data for cache hitting

Global Pool
• Storage pool can spread across all the controllers and
… … use all the SSDs connected to the controllers to store all
the LUNs’ data and meta data by RAID2.0+

30 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

FLASHLINK®: Intelligent Algorithm for Dynamic CPU Resource Scheduling

Controller CPU Core groups • A LUN is divided into multiple slices


LUN enclosure
(also called shards). Each slice is
Storage controller
Core Core Core mapped to specific CPUs in the
… Data I/O Read
exchange #1 storage controller for I/O processing
Slice #10
Core Core
• CPU Core Grouping
Slice #9
Storage controller The CPU cores are divided into
Core Core Core
Slice #8
I/O I/O Read multiple groups, and each group is
read/write #2
Slice #7
Core Core assigned a specific job
Slice #6
• Dynamic Priority Scheduling
Storage controller
Slice #5 Core Core Core
Data update
I/O Write High-priority jobs obtain more cores
……#4
Slice #1
Core Core from the shared CPU core groups
Slice #3
• Service Load Isolation
Slice #2 Storage controller Core Core Core
Data I/O Write Each CPU core only processes its own
Slice #1 reduction #2
Core Core
I/O requests to avoid mutual lock
Slice #0

31 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

How does Distributed Architecture work

Native multipath Huawei ultrapath Embedded router Native multipath


Active-Active access : map Divides into slices Active-Active access :
round-robin etc. Host distributes IOs to target controller
Host round-robin etc. Host
Shared Front End
Embedded router map SAN
SAN SAN Divides into slices
distributes IOs to
target controller
Front End IO Module Front End IO Module
Front End IO Module Front End IO Module Front End IO Module Front End IO Module

Multi processor Multi processor Multi processor


dynamic resource dynamic resource dynamic resource
allocation CPU CPU CPU CPU
CPU CPU CPU CPU allocation CPU CPU CPU CPU allocation
DHT DHT DHT

Global cache Global cache Global cache

… … …
Global Pool: RAID2.0+, Flashlink 2.0 Global Pool: RAID2.0+, Flashlink 2.0 Global Pool: RAID2.0+, Flashlink 2.0

Mid-range/entry-level with native multipath Mid-range/entry-level with ultrapath High-end with native multipath

32 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Performance Express Supported by E2E NVMe and RoCE

HOST write read Self-developed ASIC interface module:


• Offload the FCP and NoF protocol stacks
• Higher rate 32Gbps(FC)/100Gbps(ETH)
32G/FC 100G/RoCE
• The chip responds to the host directly,
reducing the number of I/O interactions.
• ASIC IO balancing/distribution
Shared Frontend Shared Frontend Shared Frontend Shared Frontend
• Multi-queue and polling, lock-free.

50us 30us
Storage Storage Storage Storage Self-developed ASIC SSD disk/enclosure:
Storage
Controller Controller Controller Controller
Engine • Read priority technology: Read requests on
SSDs are preferentially executed to respond
to hosts in a timely manner.
Shared Backend Shared Backend Shared Backend Shared Backend
• The intelligent disk enclosure is equipped
with the CPU, memory, and hardware
100G/RoCE 100us
acceleration engine. Data is reconstructed
and unloaded to the intelligent disk enclosure
to reduce latency.
Intelligent Intelligent Intelligent Intelligent • Multi-queue and polling, lock-free.
DAE DAE DAE DAE

33 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Nearly all software components are in user-mode


Reducing latency caused by interactions between kernel-state and user-mode

Traditional design Dorado V6


less user-level components, more interaction full user mode, less interaction

Value-added
Value-added OMM Space Mgmt.
features
User space OMM Space Mgmt.
features User space
NVMe Diver Disk Mgmt. Pool Mgmt.

Kernel space Driver Disk Mgmt. Pool Mgmt. Kernel space SAS Diver

Call each other between user mode and kernel mode Reduce interactions between two modes
High Latency Low Latency

34 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Intelligent NIC optimization: Traditional NIC -> TOE -> DTOE


Traditional NIC TOE DTOE
TCP offload engine Direct TCP offload engine

I/O I/O I/O

PHY
PHY PHY
MAC
MAC
MAC NIC TOE NIC IP DTOE NIC
IP TCP
TCP buffer
Driver

buffer Socket
IP Kernel space
buffer
Kernel space
TCP Driver
Driver
OS OS Socket
Socket
OS
DIF App User space DIF App User space DIF App User space

Challenge: Advantage: Advantage:


A traditional network card needs to trigger Each application can finish a complete data 1. Move processing of the transport layer
an interruption for processing each data processing process before triggering an to the Huawei customized network card
packet, and CPU resource consumption is interrupt, significantly reducing the server's 1822's microcode
severe. response to the interruption. 2. Optimize storage application software to
Challenge: adapt the new architecture
There are still high latency overheads such as 3. Implement data (from the link layer)
kernel mode interrupts, locks, system calls, and directly to the application memory
thread switching. 4. Bypassing the kernel state, significantly
reducing the latency
35 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

FlashLink®: Intelligent Disk Enclosure & Collaboration of Chipsets


System Controller
Data read/write Cache flushing Advanced features
FEI FEI
System Controller
前端接口 前端接口 Intelligent DAE
Chips *Garbage collection *Compression Data reconstruction
PCIE PCIE
DIMM DIMM DIMM DIMM

Offloading
CPU CPU
CPU CPU Offloading workloads from controllers to
CPU CPU Intelligent DAEs.
• Improve system performance by 30%.
RoCE • Improve reconstruction speed by 100%
• Lower performance impact of
reconstruction on business from 15%
Intelligent DAE to 5%
Chips DIMM CPU DIMM DIMM CPU DIMM

PCIE

Proprietary SSD
Chips

36 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

FlashLink®: Smooth GC with Multi-stream to reduce WA by 60%

Array Array

4
Multi-Stream
3.5
Standard
3
Normal Multi-Stream
2.5

2
Standard Multi-
Stream 1.5

0.5

0
Full Block Reclaim Write Amplifier Life-Cycle
Physical Block Hot Data No Garbage movement
Deleted Data Cold Data Write Amplification reduces over
60%, life cycle expands 2 times

37 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

FlashLink®: Smooth GC with Multi-stream (cont.)


Hot block list

Metadata is changed frequently, so


metadata flows are assigned to the
ROW ROW TRIM same block list to reduce the amount
Metadata of data moved during GC and improve
GC efficiency.
Warm block list
User data is changed less frequently,
and user data flows are also assigned
User data to the same block list. Data to be
ROW TRIM
moved can then be located sooner
during GC and GC efficiency can be
improved.
Cold block list
Global GC User data that has remained
unchanged for a long time is also less
likely to be changed in the future.
Such data flows are assigned to the
same block list as well so that fewer
blocks need to be scanned during GC
and GC efficiency can be improved.

38 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Global Cache with RDMA & WAL


Write Latency
LUN0 LUN1 LUN2
95us
4K 4K …… 8K 4K 4K …… 8K 4K 4K …… 8K

50us

Write Ahead Log


Cache Linear ……
A B C D E
space
Traditional Cache Dorado V6

Data
Write

A B C D E ……

Global memory virtual address space

AddrN1 AddrN2 AddrN3

Controller-A Memory Controller-B Memory Controller-C Memory Controller-D Memory

39 Huawei Confidential
Extreme Reliable Extreme Performance High Efficiency

Full stripe writing in RoW


Same performance across different RAID levels
LUN0 LUN1 LUN2
Performance
... ... ... KIOPS
4 KB 3 KB 7 KB 4 KB 3 KB 7 KB 4 KB 4 KB 16 KB Logic space

ROW and I/O


aggregation

Log structure A B C D E ...

Data
update
RAID5 RAID6 RAID-TP

A B C D E Full stripe P Q
Dorado
CKG

Traditional Way Dorado way


Extra Extra Total IOs Extra Extra
Configuration Configuration Total IOs
Reads Writes (extra IO) Reads Writes
RAID-5 2 1 4 (3) RAID-5 0 0 1
RAID-6 3 2 6 (5) RAID-6 0 0 1
RAID-TP 4 3 8 (7) RAID-TP 0 0 1 Traditional Array
40 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency

41 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Deduplication and compression: inline, Background, Fixed length, Variable length


Data Write
Pattern matching in Cache, keep hot fingerprint in
1 memory cache, increase reduction rate and performance
1
Pattern recognition
Local hot Simultaneous calculation of similar fingerprint (SFP) and
Fingerprint
Fingerprint 2 weak HASH fingerprint (FP)
2 table
table
Fingerprint calculation
Find the hot fingerprint table, if can hit the fingerprint
5 table, do online deduplication and return fingerprint
opportunity 3
Check Local table reference counting
fingerprint table

4 adaptively select the online compression algorithm


according to the load situation.
Match? No: new data 6 Background fixed-
length or similar
deduplication
5 Save fingerprint information to the opportunity table
3 4
Inline deduplication Compression
Return fingerprint & Compaction When the SFP is the same, the background
deduplication operation is implemented. If the
6 fingerprints are the same, the background fixed length
deduplication process is executed, otherwise, the
Update physical background variable length deduplication process is
address
executed.

42 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Similar Deduplication VS Variable/Fixed length Deduplication

Data reduction effect:


• 50% increase, compared to fixed length deduplication 512B VS 64B
• 30% increase, compared to variable length deduplication

Scenario Description Fixed Variable Similar

Exactly the same

Partially identical,
sector offset
Partially identical,
bytes offset
Partially identical,
anywhere different
43 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

The principle of similar deduplication

Similar Fingerprint(SFP)
Variable length deduplication is mainly achieved by relying on similar fingerprints (SFP). For a set of data, although not identical, but with
similar data, a similar fingerprint (SFP) can be calculated for these data. Calculate multiple similar fingerprints for the same data section.

Step 1: Calculate the SFP for the original and find the Step 2: Put similar data together, select a reference
data for the same SFP. block, and find difference parts for other data.
Data #1 is selected as a reference block
Calculate
1-4’s SFPs are
Calculate SFP1, can be
put together for Mark difference parts on data #2~#4
Calculate variable length
deduplication
Calculate

Calculate

Step 3: Put similar data together, select a reference block, and difference parts for
other data.

44 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Fuzzy Matching Increases Data Reduction Rate by 25%


Identify Extract Assemble
Dedupe & Comp
similar data. reference fingerprints. similar data.

0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1
0 1 0 1 1 0 0 1 1 1 0.9 0
0 1 0 1 1 0 0 1 1 1 0.9
0 1 0 1 0 0 0 1 1 1 0.8
1 1 1 1 1 1 1 1 1 1 0.7 0 1 0 1 0 0 0 1 1 1 0.8 0 0
0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0.3
0 1 0 0 1 1 0 1 1 1 0.9 0
0 1 0 0 1 1 0 1 1 1 0.9
1 0 1 0 1 0 1 0 0 0 0

25%
Higher image data
reduction rate

45 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Inline Compaction: Data Compaction By Byte

Input 8KB blocks


Input Data
Compress
Data after
Compression
Pack to 4MB Chunk
B1 B2 B3 B4 B5
Original
Data layout
1KB 3KB 5KB 7KB 9KB 10KB
Waste space
B1 B2 B3 B4 B5 Offset Metadata
Optimized Byte level compaction,
Data layout space wasted less than 1%
1KB 3KB 5KB 6KB

46 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

FlashEver: Hardware upgrade with Non-Disruption


Dorado V6 Dorado V7 – NEW HW
Controller Support controller upgrade with Non-Disruption, even
Upgrade include the next generations by 10 years

350 TB DAE
576 TB NEW DAE
Capacity Use the latest media and more bigger capacity drive to
Consolidation replace the old drives to save the physical space and
make the capacity more flexible

Maximum 128 controllers by Multi-Dorado

Up to 128 controllers, support V6 and the following


Storage generations of V6 to be a federation cluster, including
Federation data mobility with Non-Disruptive, node join and
unjoin Dorado V6 Dorado V7

H/E/N Dorado V6
Smart Migration
+ Take over the paths of the third part storage, to
Smart leverage the old storage system, smoothly cutover the Take Over
Paths
Virtualization business to run in Dorado V6 and the following ones

47 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Controller Upgrading with New Gen

Dorado V6

Step 1: In the multi-controller system,


services of controller A are switched to other
controllers.

Step 2: Remove the controller A and replace


it with a new-generation controller.

Step 3: Switch services back to the new


New Generations Dorado controller A.
V7/V8 Controller
Step 4: Repeat step 1-3 until all controllers
are replaced with new-generation
controllers.

48 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Capacity Consolidation / Disk Enclosure Change

• Step1: Connect a DAE with NEW Tech drives like


Dorado V6 SCM, QLC etc,

• Step2: Do internal data copy within one storage


system, no performance loss

• Step3: Cutover the business, and take away the


legacy DAE, recycle.

Legacy DAE
NEW DAE

49 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Storage Federation Cluster

Dorado V6 Dorado V7 Dorado V8

LUN data movement in the cluster

Federation Cluster Network (Data)

• Step1: Build a cluster network for each Dorado system, maximum controller is up to 128

• Step2: Join and Unjoin operation for the cluster with the different Dorado systems

• Step3: Internal data movement between the systems with non-disruptive

50 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Smart Migration + Smart Virtualization


Application Server

SmartMigration(included in the basic Software bundle)

Original Path

Switches • Step1: Take over the legacy storage resource(Luns).


New Path
• Step2: Huawei storage create eDev Luns, which are
Original Path virtual LUNs inherit the same WWNs of the original
Luns.

• Step3: Allocate internal resource, start the data


Take Over migration internally. No interruption in this stage.
Path
eDev LUN Target LUN

Legacy Storage Huawei OceanStor


arrays V6

51 Huawei Confidential
Take-away

52 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2018 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

Jack Lyu, lvdejian@huawei.com


Extremely Reliable Extreme Performance Cost Effective

What’s NVMe?

SAS NVMe
Designed for Disk Designed for Flash/SCM

CPU Cores CORE CORE CORE CORE CORE CORE CORE CORE
CPU Cores

SAS Controller SAS Controller

SSD/HDD SSD

54 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

NVMe Reduces Protocol Processing Latency

App
Reduced interactions: Communication interactions
are reduced from 4 to 2, lowering latency
Block Layer
Controller SSD
Controller 1. Transfer command

SCSI Initiator NVMe 2. Ready to transfer

SAS
3. Transfer data

SAS
4. Response feedback

Target 1. NVMe write command


SAS NVMe
NVMe 2. NVMe write finished
SCSI

SAS protocol stack NVMe protocol stack NVMe provides an average storage latency less than SAS 3.0.

55 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

NVMe Concurrent Queue and Lock-Free Processing

Core N
Core 0 ... Core n Core 0 ...

vs. 0 N
0
N

… …
Lock
Multiple queues
Single queue and lock-free
with lock

SAS NVMe
SAS SSD 0 SAS SSD 24 NVMe SSD 0 ... NVMe SSD 35
...

Number of queues = 25 (Dorado 5000 SAS with 25 SSDs) Number of queues = 288 (Dorado 5000 NVMe with 36 SSDs, N = 7)

• NVMe: Every CPU core has an exclusive queue on each SSD, which is lock-free.
• Count of queues for each controller = Count of disks * Count of CPU cores for processing back-end I/O.
• SAS: Each controller has a queue to each SSD, which is shared by all CPU cores. Locks are added to ensure exclusive
access of multiple cores. The number of queues for a single controller equals to the number of disks.
56 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

NVMe architecture in Storage

57 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

What’s RoCE

RDMA supports zero-copy networking


by enabling the network adapter to
transfer data from the wire directly to
application memory or from application
memory directly to the wire,
eliminating the need to copy data
between application memory and the
data buffers in the operating system.
Such transfers require no work to be
done by CPUs, caches, or context
switches, and transfers continue in
parallel with other system operations.
This reduces latency in message transfer.
-- https://en.wikipedia.org/wiki/Remote_direct_memory_access
58 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

PCIe vs NVMe-oF

Maximum of SSDs Shared Architecture DMA Engine


256 for PCIe bus total, Data Channel, No
PCIe no more than 100 for Can be shared by 2 controllers DMA, CPU
SSDs dependent.

NVMe-oF Can by shared by 8 controllers, DMA enabled, CPU


No limited
(RoCE) even 32 controllers with switch. independent.

59 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

Intelligent CPU Partition Scheduling Algorithm - Reducing Latency by 30%


64M CPU Sockets in Controllers
(LUN, LBA), Data
…. N1

N7 N2
DHT
ring
N6 N3
CPU
N5 N4

Core Grouping in CPU


I/O read/write Data switching channel Protocol parsing Data flushing

CORE CORE CORE CORE CORE CORE CORE CORE

CORE CORE CORE CORE CORE CORE CORE CORE

Dedicated Dedicated Shared

Core-based Resource Isolation


I/O read 1 I/O read 2 I/O write 1 I/O write 2

CORE CORE CORE CORE

I/O read/write grouping

60 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective

CPU multi-core load balancing optimization:


No grouping -> Grouping -> Grouping + Intelligent scheduling

Traditional Dorado V3 Dorado V6


Core grouping Grouping + intelligent scheduling

CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE

0 1 2 3 0 1 2 3 0 1 2 3
… … …

I/O I/O MirrorMirror


I/O Mirror
I/O Mirror
Scheduler

Challenge: Advantage: Advantage:


Different tasks compete for time slices Avoid interference and frequent resource According to the load status, the intelligent
of different cores of the CPU, resulting switching scheduler dispatches tasks to other cores to
in frequent copying of data IO between achieve load balancing.
different cores, resulting in high latency.
Challenge:
Different services, partial nuclear overload,
high latency

61 Huawei Confidential
Hyper Hyper Hyper Hyper Smart Hyper
Snap Clone Replication/A Replication/S Migration Metro

Hyper
DCL
CDP

Time Dual
Point Write

62 Huawei Confidential
Huawei HyperMetro-based Active-Active
Data Centers Disaster Recovery Solution

Security Level:
Contents

1 Disaster Recovery Overview

Huawei HyperMetro-based Active-Active Data


2 Centers Disaster Recovery Solution

3 Competitive Analysis

4 Key Technologies

2 Huawei Confidential
Importance of Service Continuity for IT Systems

Loss caused by system downtime per hour


Fire Device fault
648
280
200
160
63 110
9

Power outage Virus or hacker attack Media Healthcare Retail Manufacturing Telecom Energy Finance
Unit: 10,000 USD

Source: Network Computing, the Meta Group and Contingency Planning Research

3 Huawei Confidential
International Standards for Disaster Recovery Construction

Recovery Point Objective (RPO): amount of lost data caused by downtime Recovery Time Objective (RTO): downtime

International standards RPO RTO Disaster recovery solutions


Tier 7: Highly automated, business-integrated Application-level active-active data centers disaster
0 < 15 minutes
solution recovery solution

Application-level disaster recovery, CDM, backup, and


Tier 6: Zero or little data loss 0 < 2 hours
private cloud solutions

Application-level active-passive, CMD, and backup


Tier 5: Transaction integrity < 1 hour 2 to 4 hours
solutions

Several 4 to 12
Tier 4: Point-in-time copies Backup solution
hours hours

12 to 24
Tier 3: Electronic vaulting < 24 hours Backup solution
hours

24 hours to 24 hours to
Tier 2: Data backup with hot site Backup solution
days days

Tier 1: Data backup with no hot site Days Days Backup solution

Tier 0: No off-site data

Source: SHARE's seven tiers of disaster recovery released in 1992, were updated in 2012 by IBM as an eight tier model.
4 Huawei Confidential
Active-Active Data Centers Disaster Recovery Solution Ensures
24/7 Service Continuity
Active-active disaster recovery
solution

Active-passive disaster recovery


solution

FusionSphere
Local backup solution
FusionSphere

FusionSphere

Site 1 Site 2
Site 1 Site 2 Active-active data centers
Active-passive data centers

Singe data center Tiers of disaster recovery


1 2 3 4 5 6 7
5 Huawei Confidential
Contents

1 Disaster Recovery Overview

Huawei HyperMetro-based Active-Active Data


2 Centers Disaster Recovery Solution

3 Competitive Analysis

4 Key Technologies

6 Huawei Confidential
Definition of Active-Active Storage
Definition
An active-active storage solution consists of two storage systems which provide two consistent data copies in real
time. The two data copies can be concurrently accessed by one host. The failure of any copy does not affect
services. The two storage systems can be deployed in two data centers to form an active-active data centers
solution with the active-active design of upper-layer applications (as well the network layer).
Six key elements
1. Independent storage systems: An active-active relationship is established between two storage systems, and
both storage systems have independent hardware and software.
2. Active-active access: Two data copies are both in the active state (not active-passive mode) and can be
accessed by a host concurrently.
3. Convergence of SAN and NAS: SAN and NAS services can be deployed on the same device.
4. Dual arbitration modes: The active-active data centers disaster recovery solution uses an independent third-
place arbitration mechanism and supports static priority mode and quorum server mode. If the third-place
quorum server fails, the system automatically switches to the static priority mode.
5. Real-time data synchronization: Data is synchronized between the two data centers in real time and services are
automatically switched over in the event of a disaster, ensuring zero RPO and enabling the RTO to be
approximately equal to 0.
6. Smooth expansion: The active-active data centers can be further expanded to the geo-redundant layout with
three data centers.
7 Huawei Confidential
End-to-End Physical Architecture

≤ 100 km
Network layer DC outlet
Raw optical fiber
Core layer
Active-active network layer
Aggregation
Highly reliable and optimized L2 layer
interconnection
Optimal access path Access layer

Application
layer Active-active application layer
FusionSphere FusionSphere
Cross-DC high availability, load
balancing, and migration
Storage layer scheduling of Oracle RAC,
VMware, and FusionSphere

Active-active storage layer

Active-active accesses
and zero data loss
Data center A Data center B

8 Huawei Confidential
Convergence of SAN and NAS
Data center A Data center B
Working principle
One storage system is deployed in data center A and the other storage
system in data center B. Both storage systems deliver read and write
Host application services in active-active mode. Write I/Os are mirrored in real time
cluster between the two storage systems to ensure data consistency. On-
demand configuration of SAN and NAS services and high availability
capabilities are achieved at the storage layer. Active-active services
are available across data centers by cooperating with hosts and
networks at the application layer. No data is lost if either storage
SAN system fails.
FC or IP FC or IP
Highlights
NAS SAN SAN NAS
 Active-active, RPO = 0, RTO ≈ 0
 Gateway-free configuration, deployment of both file and database
FC or IP services on the same device
 Quorum server shared by both SAN and NAS services, ensuring
Production storage Production storage that either data center can provide services and ensure data
consistency in the event of a link failure
IP IP  One type of networking (FC or IP) for heartbeat, configuration, and
data replication networks, meeting all SAN and NAS transmission
requirements
Quorum server

9 Huawei Confidential
SAN
Data center A Data center B Working principle
One storage system is deployed in data center A and the other storage
system in data center B. Both storage systems deliver read and write
Oracle RAC
services in active-active mode. No data is lost if either storage system
cluster
VMware vSphere fails.
cluster
FusionSphere
Highlights
cluster  Active-active architecture, active-active LUNs, both data centers
WAN accessible to hosts, real-time data synchronization, RPO = 0, RTO ≈
FC or IP FC or IP 0
 Gateway-free configuration, simplifying networks, cutting down the
cost, and eliminating the latency caused by gateways
 Dual arbitration modes, namely static priority mode and quorum
Data mirroring
server mode, enhancing reliability
FC or IP  Upgrade from a single set of equipment to active-active mode, and
Production storage Production storage
further expansion to the geo-redundant mode with three data
IP IP centers
 Automatic repair of bad blocks across data centers
 Storage protocol optimization, reducing the number of cross-site
Quorum server
write I/O interactions by half and accelerating the overall
performance

10 Huawei Confidential
NAS
Data center A Data center B Working principle
One storage system is deployed in data center A and the other storage
system in data center B. Both storage systems deliver read and write
Host application services in active-active mode. A pair of active-active file systems is
cluster
available. The primary storage system provides data read and write
FS 1 FS 2 FS 1 FS 2
services, and data is synchronized to the secondary storage system.
An active-active switchover is executed by the granularity of tenant
WAN pairs. No data is lost if either storage system fails.

IP
SAN
IP Highlights
FS 1 FS 2  Active-active, RPO = 0, RTO ≈ 0
 Gateway-free configuration, simplifying networks, cutting down the
Real-time data
synchronization cost, and eliminating the latency caused by gateways
FS 1 FS 2 FS 1 FS 2  Dual arbitration modes, namely static priority mode and quorum
FC or IP
Production storage Production storage server mode, enhancing reliability
 Upgrade from a single set of equipment to active-active mode, and
IP network IP network further expansion to the geo-redundant mode with three data
centers
 Storage protocol optimization, reducing the number of cross-site
Quorum server
write I/O interactions by half and accelerating the overall
performance

11 Huawei Confidential
HyperMetro-based Active-Active Network
Data center A Data center B Host-to-storage network
SAN
 Support for both FC and IP networks
Host application  Full-interconnection network (recommended)
cluster  Same type (IP or FC) of networks from a server to two storage systems
 Dual-switch network required
NAS
 Support for an IP network
 Full-interconnection network (recommended)
 Dual-switch network required

Host-to-storage SAN HyperMetro replication network


IP or FC IP or FC
network  Support for both FC and IP networks
 At least two redundant links for each controller on active-active storage systems
 Recommended distance between two storage systems < 100 km, up to 300 km
 Network and bandwidth requirements
 Bit error rate ≤ 10e to 12e
IP or FC
 Recommended latency ≤ 1 ms, maximum latency < 10 ms
 No jitter, no packet loss
 Link bandwidth ≥ peak service bandwidth, at least 2 Gbit/s
Production storage Production storage
Quorum network
IP IP  Support for an IP network
 Network and bandwidth requirements
 RTT ≤ 50 ms
Quorum server  Link bandwidth ≥ 10 Mbit/s

12 Huawei Confidential
High Availability Design

Gateway-free configuration Cross-site repair of bad blocks Dual arbitration modes


Host
Read I/O Active-active links

1 Cross-site active-active cluster Storage system A Storage system B


5
Storage Storage The
system A system B preferred
Active-active LUNs site survives
2 4 first.
3 6
Bad block
LUN A LUN B
Quorum server or VM

There is no need to use extra gateway If bad blocks cannot be repaired in a storage Quorum server and static priority modes
devices, reducing fault points, simplifying system, data is read from the other storage are provided, and automatic switchover
networks, and delivering higher reliability. system to repair the bad blocks. This process is supported between the two modes. If
does not affect accesses to services. the quorum server is faulty, the static
EMC Unity series needs gateways. priority mode ensures service continuity.

13 Huawei Confidential
High Performance Design

Minimized gateway latency FastWrite Optimistic lock

1 to 1.5 ms

The gateway-free design eliminates Write commands and data transfer are The optimistic lock technology is used to
bottlenecks in the gateways, shortens combined, reducing the latency of cross- prevent concurrent write conflicts of over 99%
the I/O path, and reduces the 1 to 1.5 ms site write I/O interactions by half. host I/Os. In this way, optimistic locks are
latency. locally locked to reduce interactions between
storage systems.

14 Huawei Confidential
Smooth Expansion to Three Data Centers
Data center A Data center B DR Remote disaster DR
management recovery center C management
Highlights
DR management DR management server DR management server  Support for disaster recovery of SAN and
network network network NAS, as well as Dorado devices, ensuring
WAN WAN
database and file consistency
 Communication among entry-level, mid-range,
Application
and high-end storage systems and
server DR server DR server
(optional) (optional)
communication between all-flash and non-
flash devices, cutting down the investment in
the disaster recovery centers
Fibre Fibre Fibre  GUI-based disaster recovery management
Channel Channel Channel and one-click DR drill and recovery
switch switch WAN switch  HyperMetro for SAN scalable to three data
centers
IP management
HyperMetro
Asynchronous network
replication IP service network
Remote disaster recovery network
Fibre Channel  BCManager-based disaster recovery network
network  The BCManager eReplication
Data flow management server must interconnect with
OceanStor OceanStor OceanStor the storage, Oracle server, VMware
storage system storage system storage system Server Agent vCenter, and FusionSphere VRM
management networks.
 The bandwidth must: ≥ 2 Mbit/s.
 Asynchronous remote replication network
Unified disaster recovery management of three data
 Both FC and IP networks are supported.
Huawei disaster recovery management software, BCManager, is deployed in data center B and remote center C for  The RTT must: ≤100 ms.
unified management of HyperMetro and asynchronous remote replication. The software graphically shows physical  The bandwidth must ≥ 10 Mbit/s (Changed
topologies and logical service topologies of the three data centers and supports one-click test and disaster recovery in data volume in a service period/Replication
remote center C. period).

15 Huawei Confidential
Best Practice of Applying the Solution to Oracle RAC Applications
X bureau X bureau X institute X institute
Access service Data center Data center Access service
Service distribution design
• 2 + 1 cluster deployment
 Applicable to data centers where service distribution is
Node 1 Node 2 Node 3 uneven
 Applicable to two data centers with priorities
Service Name INSTANCE1 INSTANCE2 INSTANCE3
• Oracle RAC arbitration principles
 The sub-cluster with the largest number of nodes wins.
SERVICE1 PREFERRED PREFERRED AVAILABLE  The sub-cluster with the lowest node number wins if the
SERVICE2 AVAILABLE AVAILABLE PREFERRED numbers of nodes in the sub-clusters are the same.
Best practice Access isolation to reduce cache convergence
1. Store the binary files and home directories of Oracle Clusterware and Oracle database • Access isolation
in a local computer for periodic upgrades.
2. Assign 60% to 80% of system memory capacity to databases. Assign 80% of the  You are advised to create different services at the Oracle
memory capacity for the OLTP database to the system global area (SGA) and 20% to RAC layer to separate services and prevent data
the program global area (PGA). Assign 50% of the memory capacity for the OLAP interactions across data centers.
database to the SGA and 50% to the PGA.
 The preferred function of Oracle RAC transparent
3. If hyper-threading is enabled, you are advised to set parallel_threads_per_cpu to 1.
4. You are advised to set PARALLEL_MAX_SERVERS to application failover (TAF) is used to make applications
Min(2*parallel_threads_per_cpu*#cores, #disks). access local instances only and set instances in the
5. You are advised to use the fast_start_mttr_target parameter to control the recovery remote data center to available so that access requests
time (for example, 300 seconds).
6. To minimize the impact on performance caused by "checkpoint not complete" or
are switched to remote instances only when all local
frequent log switchovers, you are advised to create three redo log groups for each instances are faulty.
thread, and the size of a redo log should allow a log switchover to happen every 15 to You are advised not to deploy Oracle RAC in a virtualization environment
30 minutes. (advice from Oracle).

16 Huawei Confidential
Best Practice of Applying the Solution to VMware Applications

APP APP APP APP

OS OS Management and OS OS

heartbeat
APD
 Trigger condition: All links between ESXi hosts and storage systems fail
Storage path
but ESXi heartbeats work properly.
 Symptom: VMs on the ESXi host are suspended and cannot automatically
recover.
 Huawei solution: Detect paths and use a timeout mechanism to notify the
ESXi host of implementing automatic high-availability recovery for the
VMs. This is the only solution that can be used to resolve the problem.
APP APP APP APP
PDL
OS OS OS OS
Management and
 Trigger condition: Links between storage systems fail and arbitration is
heartbeat
implemented but the ESXi heartbeats work properly.
 Symptom: VMs on the ESXi host are suspended and cannot automatically
Storage path recover.
 Huawei solution: Enable the ESXi host to identify the PDL status and
implement automatic high-availability recovery of VMs.

17 Huawei Confidential
Contents

1 Disaster Recovery Overview

Huawei HyperMetro-based Active-Active Data


2 Centers Disaster Recovery Solution

3 Competitive Analysis

4 Key Technologies

18 Huawei Confidential
FastWrite — Higher Dual-Write Performance

Common solution FastWrite


Host Huawei storage Huawei storage Host Host Huawei storage Huawei storage Host
100 km
100 km

1. Write command FC or IP FC or IP
1. Write command
2. Ready 2. Ready
3. Data transfer 3. Data transfer

RTT-1
RTT-1

RTT-2

8. Status Good

Site A Site B Site A Site B

 Common solution: A write I/O involves two  FastWrite: The protocol is optimized to combine write
interactions between two storage systems, namely, command and data transfer into one transmission.
write command and data transfer. The number of cross-site write I/O interactions is
 One 100-km transmission involves two RTTs. reduced by half.
 One 100-km transmission link involves only one RTT.

19 Huawei Confidential
Optimistic Lock Optimization (Write Process)
Latency = t1 + t2 + t3 Latency = t1 + t3

Host cluster Host cluster

Write I/O Write I/O


t1 Cross-site active-active cluster t1 Cross-site active-active cluster
Applying distributed lock

Storage A t2 Storage B Storage A Storage B

HyperMetro LUN Apply Apply


HyperMetro LUN
local lock local lock

t3 t3

Member Member Member Member


LUN LUN LUN LUN
Preferred site Non-preferred site Preferred site Non-preferred site

Write process with distributed lock Write process with optimistic lock

20 Huawei Confidential
HyperMetro Arbitration Design

Arbitration design
• Quorum servers are deployed at a third-place site that is in a
Storage resource pool
different fault domain from the two active-active data centers.
Two quorum servers are supported to prevent single points of
failures.
Preferred site
• The failure of a quorum server does not affect active-active
services, and the arbitration mode automatically switches to
Storage system A Storage system B the static priority mode.
IP
Note: Two quorum servers work in active-standby mode. Only
one quorum server is in effect at a point in time.
• If there is no quorum server, the arbitration mode is
Standby quorum server
Third-place site Active quorum server configured to static priority mode.
Note: The failure of the preferred site interrupts services.
• Quorum device: Physical servers or virtual servers can be used as
quorum devices. Two quorum servers can be deployed. • Compared with the static priority mode, the quorum server
• Quorum link: IP addresses must be reachable. mode delivers higher reliability that ensures service continuity
• Arbitration mode: Both quorum server mode and static priority mode are in the event of a single point of failure. Therefore, the quorum
offered. server mode is recommended.
• Arbitration granularity: Arbitration is performed based on LUN pairs or
consistency groups for SAN as well as vStores for NAS.
21 Huawei Confidential
Cross-Site Bad Block Repair
Working principle

1. The production host reads data from storage A.


Host
2. Storage A detects a bad block by verification.
3. The bad block fails to be repaired by
Read I/Os
reconstruction. (If the bad block is repaired, the
1 5 HyperMetro
following steps will not be executed.)
Storage A Storage B 4. Storage A checks the status of storage B and
Active-active LUNs
initiates a request to read data from storage B.
2 4
3 6 5. Data is read successfully and returned to the
Bad block production host.
HyperMetro HyperMetro
member LUN member LUN
6. The data of storage B is used to repair the bad
block's data.

The cross-site bad block repair technology is Huawei's


patented technology. It can be automatically executed.

22 Huawei Confidential
Summary

POC statistics in Russia


IP or FC
5%
IP IP 15%

Gateway-free Flexible 15% 65%


POC
SAN + NAS SAN + NAS

Active Active DR

Simple Three data centers Data speaks louder


POC success rate up to 95%
Dorado V3 and unified storage
Dorado V6 2020 H1

23 Huawei Confidential
Bring digital to every person, home, and organization
for a fully connected, intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain


predictive statements including, without limitation,
statements regarding the future financial and operating

Thank You results, future product portfolio, new technology, etc.


There are a number of factors that could cause actual
results and developments to differ materially from
those expressed or implied in the predictive
statements. Therefore, such information is provided
for reference purpose only and constitutes neither an
offer nor an acceptance. Huawei may change the
information at any time without notice.
Huawei Storage Portfolio
Solutions

FusionData DR Backup Archive


Management

Device Management Intelligent Storage Management Intelligent O&M


Data

DeviceManager OceanStor DJ eService

 a All-Flash Storage Hybrid Flash Storage Distributed Storage


Intelligent
Storage

OceanStor Dorado OceanStor 18500/18800 V5 OceanStor 6800 V5 OceanStor 100D


OceanStor Dorado
18000 V6 8000 V6

OceanStor Dorado OceanStor OceanStor


OceanStor Dorado OceanStor
5300/5500/5600/5800 V5 2200/2600 V3 OceanStor 9000
6000 V6 5000 V6 Dorado 3000 V6

1
Huawei Converged Storage
Architecture and Technical Overview
1 Introduction to Converged Storage Products

2 Converged Storage Architecture

3 Software and Features

3
Huawei Converged Storage Overview

Entry-Level Mid-Range High-End


OceanStor 6800 V5 OceanStor 18500/18800 V5

OceanStor
OceanStor 5800 V5
OceanStor
OceanStor 5600 V5
OceanStor 5500 V5
2200/2600 V3 5300 V5

Entry-Level Mid-Range High-End


Controller enclosure height 2U 2U 4U
Controller expansion 2 to 8 2 to 16 2 to 32
Max number of disks 300 to 500 1200 to 2400 3200 to 9600
Cache specifications 16 GB to 256 GB 128 GB to 12 TB 512 GB to 32 TB

4
Multi-Level Convergence Makes Core Services More Agile

HyperMetro
Dorado
V5 V3
NAS+SAN SAN NAS SAN NAS

Multiple storage SAN and NAS SSD and HDD Multiple storage A-A for SAN and
types resource pools NAS
Pooling of heterogeneous Gateway-free converged data
Interconnection between Support for multiple types of HDDs and SSDs converged to
storage resources DR solution
different types, levels, and services meet the performance
Unified management and Smooth upgrade to 3DC
generations of flash storage Industry-leading performance requirements of complex
automated service orchestration
and functions services

Multi-level convergence
99.9999% service availability, satisfying complex service requirements

5
1 Introduction to Converged Storage Products

2 Converged Storage Architecture

3 Software and Features

6
OceanStor Converged Storage V5 Architecture: Comparison
Converged and parallel file and block SAN over NAS file system SAN over NAS file system
services architecture architecture

iSCSI/FC NFS/CIFS/FTP/HTTP File services Block services NAS


NFS/CIFS/HTTP/FTP FC/iSCSI

Block services File services WAFL


Virtualization

RAID Manager
RAID DP / RAID4+
Storage pool RAID 2.0+

Storage subsystems

SAN

• Converged block and file services • Converged block and file storage
• Parallel processing of file system • WAFL-based architecture • Standalone NAS gateway
services and block services in a • Unified file & block manager • NAS storage pool consisting of
storage pool • Physical RAID group one or more LUNs mapped
• Storage pool based on RAID 2.0+ from SAN storage

7
OceanStor V5 Software Architecture Overview

iSCSI/FC NFS/CIFS Operation Unified: NAS and SAN


System management software stacks are
control
Smart series: SmartThin, SmartDedupe, Deploy parallel. File systems are
Data service
SmartCompression, SmartQoS, SmartPartition, in redirect-on-write (ROW)
Cluster SmartTier, and SmartQuota Expand
service mode and LUNs are in
Hyper series: HyperSnap, HyperReplication, HyperVault, Upgrade
HyperClone, HyperLock, and HyperMetro
copy-on-write (COW)
License mode, adaptive to different
Resource
scenarios.
object Monitor
Cache
management Block service File service
service Alert Converged: NAS and SAN
Log resources are converged in
System object allocation on the
management Dashboard
Storage pool RAID 2.0+ management plane. SAN
Topology
and NAS resources are
Heterogeneous directly allocated from the
Kernel and Parallel Memory I/O
driver computing management scheduler Protection storage pool. All resources
Device object
management (disks, CPUs, and memory)
Recovery
are shared, fully utilizing
SSD SAS NL-SAS Backup resources.

8
Convergence of SAN and NAS: How to Implement
Disk Domain Storage Pool LUN & FS Protocol

Grain Level-1
(Thin & iSCSI/FC
CKG Level-2 cache Extent Dedupe &
Tiered cache
(RAID group) (Stripe cache) (Tiering unit) Compression) Thick LUN (LUN & FS cache)

SSD

Level-1 cache
Block-level Tiered
tiering Thin LUN

Level-2 cache
Tier0
SAS Not tiered
Thin LUN

Tier1
NL-SAS
Directly divided Grain File system
from CKG

Level-1
NFS share and CIFS share

cache
Tier2 File-level tiering

OceanStor converged storage architecture minimizes the I/O paths of SAN and NAS, and provides optimal performance.

9
OceanStor V5: Reliable Scale-out Storage

Data reliability Device reliability Solution reliability

Reliable RAID technology Reliable cluster system Reliable solution

CKG CKG

App App

Active-Active

SSD HDD

RAID 2.0+ architecture SmartMatrix 3.0 architecture Active-active architecture

RAID 2.0+ SmartMatrix 3.0 HyperMetro


Fastest reconstruction Tolerance of 3-controller failure RPO = 0, RTO ≈ 0

10
Key Technology: RAID 2.0+ Architecture

Hot Hot
spare spare

Traditional LUN RAID 2.0+ block


RAID virtualization virtualization
EMC: VMAX Huawei: RAID 2.0+
HDS: VSP HP 3PAR: Fast RAID
NetApp: FAS

20-fold faster data reconstruction


 Huawei RAID 2.0+: bottom-layer media virtualization + upper-layer resource virtualization for fast data reconstruction and intelligent resource allocation
 Fast reconstruction: Data reconstruction time is shortened from 10 hours to only 30 minutes. The data reconstruction speed is improved 20-fold. Adverse service impacts and disk
failure rates are reduced.
 All disks in a storage pool participate in reconstruction, and only service data is reconstructed. The traditional RAID's many-to-one reconstruction mode is transformed to the many-to-
many fast reconstruction mode.

11
More Reliable System with 20-Fold Faster Data Reconstruction

Reconstruction principle of Reconstruction principle With RAID 2.0+, data reconstruction time
traditional RAID of RAID 2.0+ plummets from 10 hours to 30 minutes

10

6
10 hours
4

2 30
minutes
0
Traditional Huawei's quick
technology recovery technology

Time for reconstructing a 1 TB NL-SAS disk

 The reconstruction speed of a traditional RAID


group is 30 MB/s, and it takes 10 hours to
reconstruct 1 TB data.
Many-to-one reconstruction Many-to-many reconstruction

Hot spare disk required, long Parallel reconstruction, short


 According to Huawei's test results of
reconstruction time reconstruction time reconstructing 1 TB data, RAID 2.0+ shortens
the time from 10 hours to 30 minutes.

12
SmartMatrix 3.0 Overview

Fully interconnected and shared


architecture for high-end storage
 Four-controller interconnection: Front-end Fibre
Channel interface modules, back-end interface
modules, switch cards, and controllers are fully
interconnected. Front-end and back-end I/Os
are not forwarded.

 Single-link upgrade: When a host connects to a


single controller and the controller is upgraded,
interface modules automatically forward I/Os to
other controllers without affecting the host.

 Non-disruptive reset: When a controller is reset


or faulty, interface modules automatically
forward I/Os to other controllers without
affecting the host.

 New-gen power protection technology:


Controllers house BBUs. When a controller is
removed, its BBU provides power for flushing
cache data to system disks. Even when multiple
controllers are concurrently removed, data is not
lost.

Note: The iSCSI and NAS protocols do not


OceanStor 6800/18500/18800 V5 support front-end interconnect I/O modules.

13
Key Technology: SmartMatrix 3.0 Front-End Interconnect I/O Module (FIM)

 External FC links are established


between hosts and FIMs. Each FC port
on an FIM connects to four controllers
using independent PCIe physical links,
allowing a host to access the four
controllers via any port.
 An FIM intelligently identifies host I/Os.
Host I/Os are sent directly to the most
appropriate controller without
pretreatment of the controllers,
preventing forwarding between
controllers.
 In the figure, if controller 1 is faulty,
services on controller 1 are switched
over to other controllers within 1s. At the
same time, the FIM detects that the link
to controller 1 is disconnected and
redistributes host I/Os to other
functioning controllers by using the
intelligent algorithm. The entire process
is completed quickly and transparent to
hosts without interrupting FC links
between FIMs and hosts.

14
Key Technology: SmartMatrix 3.0 Persistent Cache

A A* C C* A C C* A C
B* B D* D B D* D B D

A* B* C* A*
D* B*

A B C D A B C D A B C D

Normal Failure of one controller Failure of more controllers


(such as controller A) (such as controllers A and D)

 If controller A fails, controller B takes over its cache, and the cache of controller B (including that of controller A) is
mirrored to controller C or D. If controller D is also faulty, controller B or C mirrors its cache.
 If a controller fails, services are switched rapidly to the mirror controller, and the mirror relationship between it and
other controllers is re-established, so each controller has a cache mirror. In this way, write-back (instead of write-
through) for service requests is ensured. This guarantees the performance after the controller failure and ensures
system reliability, because data written into the cache has mirror redundancy.

15
Key Technology: Load Balancing of SmartMatrix 3.0 Persistent Cache
Controller A Controller B Controller A Controller B

A1 B1' B1 A1' B1 A3'


A2 C2' B2 D2' B2 D2'
A3 D3' B3 C3' B3 C3'
A1 D3'

C1 D1' D1 C1'
C1 D1' D1 C1' C2 A1' D2 B2'
C2 A2' D2 B2' C3 B3' D3 A2'
C3 B3' D3 A3' A2 B1' A3 C2'

Controller C Controller D Controller C Controller D


Work cache Mirror cache Work cache Mirror cache

Load Balancing of SmartMatrix 3.0 Persistent Cache


 Load balancing among four controllers: The cache of each controller's LUN is evenly mirrored to the cache of the other
three controllers. The cache is persistently mirrored.
 Four-controller balanced takeover: If one controller is faulty, all LUNs on the faulty controller are evenly taken over by the
other three controllers.
 Switchover of four owning controllers: The owning controller of LUNs and file systems can be switched to any of the
other three controllers in seconds.

16
Key Technology: HyperMetro (Block & File)

Site A Site B
Working Principles
 One device
Host application
Gateway-free, one device uses HyperMetro to support both active-active files
cluster (shared
volumes and databases.
mounted to  One quorum system
HyperMetro file SAN and NAS share one quorum site. Services are provided by the same site
systems) in the event of link failures, ensuring data consistency.
 One network
Storage network
between storage
The heartbeat, configuration, and physical links between two sites are
arrays and hosts integrated into one link. One network supports both SAN and NAS
IP&FC IP&FC transmission.
SAN
Real-time data
NAS SAN mirroring SAN NAS
Dual-write heartbeats
and configurations
Highlights
 Active-Active, RPO = 0, RTO ≈ 0
 Requires no gateway devices, simplifying networks, saving costs, and
FC/IP eliminating gateway-caused latency.
Production storage Production storage  Supports two quorum servers, improving reliability
 Supports flexible combination of high-end, mid-range, and entry-level
IP IP storage arrays for active-active solutions, saving investment.
 Supports smooth upgrade from active-active or active-passive solutions to
3DC solutions without service interruption.
 Flexibly supports 10GE or FC networks for intra-city interconnection and IP
networks for quorum links.
Quorum site

17
1 Introduction to Converged Storage Products

2 Converged Storage Architecture

3 Software and Features

18
OceanStor Converged Storage Features

SAN Features

iSCSI FC

SmartThin SmartQoS SmartPartition SmartCache


Intelligent thin provisioning Intelligent service quality control Intelligent cache partitioning Intelligent SSD cache

SmartDedupe SmartCompression SmartMulti-Tenant SmartTier


Intelligent inline deduplication Intelligent inline compression vStore (tenant) administration Intelligent block tiering

HyperSnap HyperReplication HyperMirror HyperCopy


Snapshot Synchronous or asynchronous Local active-active LUNs LUN copy
remote replication

HyperClone HyperMetro VVol Support Encryption/Key Mgmt.


LUN clone Synchronous mirroring for Support for VMware VVols Internal Key Manager
automatic failover Self-encrypting drive (SED)

19
OceanStor Converged Storage Features
NAS Features

CIFS (v1/v2/v3) NFS (v3/v4) HTTP/FTP/NDMP

SmartThin SmartQoS SmartPartition SmartCache


Intelligent thin Intelligent service Intelligent cache Intelligent SSD cache
provisioning quality control partitioning
SmartDedupe&Compression SmartQuota SmartMulti-Tenant SmartTier
Intelligent inline deduplication & compression Quota management vStore (tenant) administration Intelligent file tiering

HyperSnap HyperReplication HyperVault HyperMetro HyperClone


Snapshot Internal and external Inter-array backup A-P cluster FS virtual clone
asynchronous replication
GNS Internal DNS Service HyperLock
Global namespace Auto-load balance for NAS WORM
connections

20
SmartMulti-Tenant Architecture: Network, Protocol, and Resource Virtualization

Protocol virtualization
vStore vStore
AD Client
Client
Client
Client
Client LDAP/NIS
Client

LIF0 LIF1 LIF2 LIF3


Transmission (TCP/IP)
Authentication server User/Group manager
User & AD & LADP LADP
Group Share Share &NIS Share
DNS & NIS
DB Share AD & DNS Share LADP & NIS

CIFS service NFS service NFS service


CIFS service NFS service

Lock manager
FS 0 FS 1 FS 2 FS 3 VFS

 With protocol virtualization, each vStore has separated NAS


Storage pool A Storage pool B service instances. Each vStore can configure NAS services
separately, isolate I/O requests, and configure different
AD/LDAP/NIS services from other vStores.

21
HyperMetro for NAS

Site A Site B

Working Principles
FS FS FS FS High-availability synchronous mirror at a file-system level: When data is
written to the primary file system, it will be synchronously replicated to the
secondary file system. If the primary site or file system fails, the secondary site
HyperMetro vStore pairs:
(vStore1  vStrore1') or file system will automatically take over services without any data loss or
IP/FC IP/FC
(vStore2  vStrore2') application disruption.

HyperMetro pairs:
FS1 FS1' (FS1  FS1')
vStore1 vStore1' (FS2  FS2')
FS2
Data and
FS2'
(FS3  FS3') Highlights
FS3 configuration FS3' (FS4  FS4')
vStore2 sync
vStore2'
 Gateway-free deployment
FS4 FS4'  1 network type between sites
FS5 FS5'  2 components required for smooth upgrade
vStore3 vStore3'
FS6 FS6'  3 automatic fault recovery scenarios
 4x scalability
 5x switching speed
IP IP

Quorum server

22
HyperMetro Architecture
LADP, AD, and NIS servers
Client

vStore A at the primary site vStore A' at the secondary site

LIF0 LIF1 LIF2 LIF0 LIF1 LIF2 Network

User & AD & LADP User & AD & LADP


Share Share Share Share
Group DNS & NIS Group DNS & NIS
Protocol instance

CIFS NFS CIFS NFS

File system
FS 0 FS 1 FS 2 FS' 0 FS' 1 FS' 2

Storage pool A Storage pool B


• HyperMetro enables synchronous mirroring of access networks, protocol instances, and file systems in a vStore, ensuring a seamless service failover to the secondary
site when the primary site fails.
• vStore A' on the secondary site is in the passive state. File systems in vStore A' cannot be accessed. After the failover, vStore A' is in the active state, and LIF, protocol
instances, and file systems are enabled. Service data, protocol configuration data, and network configurations will be identical, and the client can recover services by
retrying requests.

23
Service and Access Processes in the Normal State
1-7: Configurations made by the administrator on
NAS client Admin
vStore A are synchronized to vStore A' in real
10.10.10.1 time, such as the quota, qtree, NFS service,
11 18 CIFS service, security strategy, user and user
1 7
10.10.10.10 10 8 10.10.10.10
group, user mapping, share, share permission,
DNS, AD domain, LDAP, and NIS.
2
NAS service CFG sync
4
CFG sync NAS service If a failure occurs, the changed configurations
12 17 2 3 5 6 are saved in the CCDB log and the vStore pair
status is set to "to be synchronized". After the
File system CCDB CCDB File system
link is recovered, the configurations in the CCDB
13 16 vStore A vStore A'
16
14 14 14 log are automatically synchronized to vStore A'.
Object set Data sync Data sync Object set

14 15
15 15 15 14 15 8-18: When a NAS share is mounted to the
Concurrent write
9
Cache Cache client, the storage system obtains the access
permission of the share path based on the client
IP address. If the network group or host name
Storage pool Storage pool
has the permission, the client obtains the handle
of the shared directory. If a user writes data into
a file, the NAS service processes the request
Storage system A Storage system B
and converts it into a read/write request of the
file system. If it is a write request, the data
LADP/NIS
synchronization module writes the data to the
server caches of both sites simultaneously, and then
returns the execution result to the client.

24
Service and Access Processes During a Failover

Admin NAS client

10.10.10.1
1 8
10
10.10.10.10 10.10.10.10 1-8: When vStore A is faulty, vStore A'
13
detects the faulty pair status and
NAS service CFG sync CFG sync NAS service applies for arbitration from the quorum
11 12 2 7 server. After obtaining the arbitration,
File system CCDB CCDB File system vStore A' activates the file system, NAS
vStore A vStore A' 3 6 service, and LIF status. NAS service
Object set
14 Data
CCDB log Object set configuration differences are recorded
synchronization 4 5 in the CCDB log, and data differences
Cache DCL
9 Cache are recorded in the data change log
9 (DCL). In this manner, vStore A can
Storage pool Storage pool synchronize incremental configurations
and data upon recovery.
Storage system A Storage system B
The CCDB log and DCL are configured
with power failure protection and have
high performance.
LADP/NIS server

25
NFS Lock Failover Process

Synchronize a client's IP address pair Notify the client The client reclaims the lock

Mount1: Mount1:
10.10.10.11:/fs1 Mount1:
10.10.10.11:/fs1
10.10.10.11:/fs1
10.10.10.1 10.10.10.1
10.10.10.1
Notify Reclaim
10.10.10.11 10.10.10.11
10.10.10.11 (inactive) 10.10.10.11 10.10.10.11
(inactive) 10.10.10.11

NAS NAS NAS NAS NAS NAS


service service service service service service

Back up
Configuration client info Configuration Read
synchronization synchronization configuration

CCDB
CCDB CCDB

vStore A vStore B vStore A vStore B vStore A vStore B

1. HyperMetro backs up a 1. The NAS storage reads the list 1. The client sends a lock
client's IP address pair to of IP address pairs from the reclaiming command to the
remote storage. CCDB. storage.
2. The NAS storage sends 2. The storage recovers byte-
NOTIFY packages to all clients range locks.
to reclaim locks.

26
NAS HyperMetro: FastWrite

General Solution FastWrite


OceanStor V5 OceanStor V5 OceanStor V5 OceanStor V5
Host storage storage Host storage
Host storage Host
100 km
100 km

1. Write command FC or 10GE FC or 10GE


1. Write command
2. Transfer ready 2. Transfer ready
3 Data transfer 3 Data transfer

5. Transfer ready 5. Status good RTT-1


RTT-1

RTT-2
8. Status good

Site A Site B Site A Site B

 FastWrite: A proprietary protocol is used to combine the two


 General solution: Write I/Os undergo two interactions at two interactions (write command and data transfer). The cross-site write
sites (write command and data transfer). I/O interactions are reduced by 50%.
 100 km transfer link: two round trip time (RTT) delays  100 km transfer link: RTT for only once, improving service
performance by 30%

27
HyperSnap
The host
modifies data D
Before snapshot During snapshot Modify data after a
creation creation snapshot is created
Active Snapshot Active Snapshot Active
volume mapping table volume mapping table Volume Copy-on-write (COW)
Used by LUNs of OceanStor
A B C D A B C D A B C D1 D V5 storage
New
data Modified
data
Data block D is copied and the mapping table is modified.

Before snapshot During snapshot Modify data


creation creation
Redirect-on-write (ROW)
Active FS Snapshot Active FS Snapshot Active FS
Used by file systems of
OceanStor V5 storage

A B C D A B C D A B C D B D1 E1 E2
Deleted Modified New
data data data

28
HyperVault
2.1 Working Principles (1)
 The initial backup of HyperVault is
a full backup, and subsequent
backups are incremental backups.
 based on file systems, the backups
are completely transparent to hosts
and applications.
 Each copy at the backup file
system contains full service data,
not only the incremental data.
 The data at the backup file system
is stored in the original format and
is readable after the backup is
complete.

29
HyperVault

Policy-based automatic backup

Backup policy: local backup (up to four policies) and


remote backup (up to four policies)
Number of backups supported by each policy: 3
to 256
Backup period: monthly, weekly, or daily.

• Incremental backup and restoration


• Earliest backup deleted automatically
based on a backup policy
• Both local backup and remote backup
supported for restoration.
• Interoperability among high–end, mid-
range, and entry-level storage arrays

30
DR Star (SAN)
I/O process:
DC 1
1. The host delivers I/Os to the primary LUN-A.
LUN-A (Ta) 2. The primary site dual-writes the I/Os to the secondary LUN-B.
Asynchronous replication (standby)
3. A write success is returned to the host.
4 4. Asynchronous replication starts and triggers LUN-A to activate the time slice
1 Ta+1. New data written to LUN-A is stored in this time slice, and the Ta slice is
LUN-A (Ta+1) DC 3 used as the data source for the standby asynchronous replication.
3
5. LUN-B activates a new time slice Tb+1, and the new data is stored in this time
LUN-C (Tc+1) slice. LUN-C activates a new time slice Tc+1 as the target of asynchronous
replication. Tc is the protection point of asynchronous replication rollback.
Active-active 2 5 6. LUN-B (Tb) is the data source for asynchronous replication to LUN-C (Tc+1).
Standby
DC 2 6 host
Data in DC1 and DC2 is synchronous. After the data is copied from Tb to Tc+1,
LUN-C (Tc)
the data in Ta is also copied to Tc+1. This process is equivalent to the
LUN-B (Tb+1) asynchronous replication between DC1 and DC3. If DC2 is faulty, DC1 and DC3
5 are switched to asynchronous replication. Incremental data is replicated from Ta
Asynchronous replication to DC3.
Standby
LUN-B (Tb)
host
Compared with the common 3DC solution:
1. There is a replication relationship between every two sites. Only one of the
Item Huawei H** E**
two asynchronous replication relationships has I/O replication services and
Active-Active + the other one is in the standby state.
asynchronous remote Supported Supported Not supported 2. If the working asynchronous replication link is faulty or one of the active-
replication active sites is switched over, the working link is switched to the standby link.
Synchronous remote Then, incremental synchronization can be implemented.
replication + asynchronous Supported Not supported Supported 3. You only need to configure DR Star at one site.
remote replication 4. The DR Star supports active-active + asynchronous + standby and
synchronous + asynchronous + standby networking modes. The
Configured at one site Supported Not supported Supported
asynchronous + asynchronous + standby networking mode is not supported.

31
SmartTier (Intelligent Tiering)

LUN

Extent I/O monitoring Collects statistics on the activity


levels of each extent.

Data distribution analysis Ranks the activity level of each extent.

Tier0: SSD Tier1: SAS Tier2: NL-SAS Relocates data based on the rank
Data relocation
and relocation policy.

ROOT

Dir Dir
File system
Indicates the user-defined file write policy and
relocation policy.
The supported attributes include the file
File policy
size/name/type/atime/ctime/crtime/mtime.
File
Scans the list of files to be relocated based on
File distribution analysis the file policy.

Tier0: SSD Tier1: SAS/NL-SAS File relocation Relocates files based on policy.

32
SmartTier for NAS (Intelligent File Tiering)

File system
Highlights
Tiering policy

Automatic relocation mode:


File File Files are first written into SSDs, and then dynamically
scanning scanning
relocated between SSDs and HDDs based on the SSD
Performance Capacity tier
tier (SSD) (HDD)
usage and file access frequency.
File
relocation
Customized relocation mode:
Metadata
Users are allowed to specify file policies (including the
file name, file name extension, size, creation time,
access time, and modification time) and relocation
policies (weekly, in intervals, and immediately).
Comparison between SmartTier for Block and SmartTier for File
Data Relocation
Feature Tier Scope
Granularity
Relocation Speed Relocation Mode

High, medium, and


SmartTier for Block Three tiers (SSD/SAS/NL-SAS) One storage pool Extent Automatic
low
Automatic and customized
SmartTier for File Two tiers (SSDs and HDDs) One file system File Automatic
policies

33
SmartTier File Relocation Principles

SmartTier policy

Initial file write policy


• Preferentially writes data into the
performance tier.
Performance tier (SSD) Scan the file system, obtain file • Preferentially writes data into the
attributes, and identify hot and capacity tier.
cold files based on user • Determines the tier to which the file is
configurations. written based on the file attributes.

Relocation period
• Specifies the start time.
Add files to the background • Specifies the running duration.
relocation task. • Can be Paused.

Capacity tier Relocation condition


(NL-SAS/SAS HDD) • File time (Atime/Ctime/Mtime/Crtime)
Relocate these files to the • File size
specified media. • File name
SA NL- NL- • File name extension
S SAS SAS • SSD utilization
Hot file
Policy specifications
Cold file The relocation is complete. • A maximum of 10 file policies can be
created for each file system.
• Each policy supports multiple
conditions.
• The priority of file policies can be set.

34
SmartTier for NAS and Background Deduplication & Compression
Configure SmartTier to improve performance and save space:
 Enable SmartTier for the file system and configure the automatic relocation mode where all data is written into the performance tier
(SSD tier).
 Set the SmartTier relocation time from 22:00 to 05:00.
 In SmartTier, enable deduplication and compression during relocation.

SmartTier policy example


0 1 2 3 4
8 A.M. to 8 P.M. 10 P.M. to 5 A.M. 5 A.M. to 8 A.M. 8 A.M. to 8 P.M.

Performance tier
(SSD)

Capacity tier SAS SAS


NL-SAS NL-SAS
(NL-SAS/SAS HDD)

Create a file system. New data is Data is deduplicated and Deduplication New data is
written to SSDs. compressed when and compression written to SSDs.
relocated to HDDs. are complete.

35
THANK YOU

Copyright © 2020 Huawei Technologies Co., Ltd. All Rights Reserved.


The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new
technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such
information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice.
2.0

2019/9

Storage Reliability Technologies


HUAWEI TECHNOLOGIES CO., LTD.
Outline
Contents
1、Basic Concept&Metrics of Reliability/Avalability
2、Module-level, System-level and Solution-level Reliabiltiy Technologies of OcenStor Dorado V6
3、O&M Reliability of OcenStor Dorado V6

Objectives
Upon completion of this course, you will be able to understand OceanStor Dorado V6's key reliability
features and their technical principles.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 2


Contents

1 Storage Reliability Metrics

2 Module-Level Reliability

3 System-Level Reliability

4 Solution-Level Reliability

5 O&M Reliability

6 Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 3


Storage Reliability Metrics
Failure rate  = 1/MTBF, 1 FITs = 10-9 (1/h)
Return repair rate F (t) =  x t
MTBF Annual return repair rate =  x 8760
Mean Time Between
Failure
Maintainability
MTTR
Repair rate  = 1/MTTR
Mean Time To Recover

MTBF DT = (1 – A) x 8760 x 60 (minutes/year)


Availability (A) =
MTBF + MTTR
Expressed in 0.99999... or x nines Consequence
Downtime (DT)
System reliability metrics: MTBF, Time during which a service or
MTTR, and availability function is unavailable

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 4


Overview of Storage System Reliability
Level 3: Data reliability Service availability O&M reliability
Solution level
Data protections and HyperReplication
HyperSnap HyperClone HyperMetro Intelligent
(remote Fast upgrade
disaster recovery (DR) (snapshot) (data clone) (active-active) prediction
replication)
solutions provide
system-level data 3DC (geo-
protection and DR. redundancy)

Multiple cache Switchover Continuous High disk fault


Level 2: copies
RAID 2.0+
within seconds mirroring
Wear leveling
tolerance
System level
Reconstruction Dynamic HyperMetro- Bad block/sector Online
System-level reliability offloading reconstruction Inner Overload scanning diagnosis
design enables fault self- (high-end control
healing and data integrity I/O data models) Quick response to Isolation of
protection for a system. protection slow I/Os slow disks

Level 1:
Module level Hardware
Component Device Environment Production Disk
Lean manufacturing and reliability
processing ensure the
yield rate.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 5


Contents
1 Storage Reliability Metrics

2 Module-Level Reliability

3 System-Level Reliability

4 Solution-Level Reliability

5 O&M Reliability

6 Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 6


Module-Level Reliability — Overview
Module System Solution
Level Level Level

Component reliability Disk reliability Data integrity Planned activity


• Optical module • Storage component • Fault forecast • Anti-vibration design • Error prevention
• T10 PI
Component

• Clock • Connector • Production- • ERT • Replacement of


component • Dedicated IC phase filtering • Heat dissipation design
• Chip ECC/CRC faulty

Media
• Soft failure components
prevention • Reliable upgrade
Reliability of Huawei- HSSD reliability • Reliable
• Protocol data
developed chips • Backup power • Load balancing algorithm integrity expansion
• Error detecting • ECC/CRC reliability • Error correction algorithm • Data storage
• Error handling • BIST • RAID • Bad block management redundancy
Running fault
• Key hardware signal • Board
detection and self-
Board

• Storage component Manufacturing healing


Board reliability • Low-speed process/material
• Clock signal
management bus • BIST reliability
• Board power supply
• Board temperature • SI/PI • Fault diagnosis
• Materials and locating
screening • Fault prediction
Redundancy design System power supply/backup System cooling • Burnin screening • Power on self-test
Device

• ORT • Environment check


power • Component check
• FRU/Network • IQC
• PDU/BBU/CBU/Power supply reliability • Fan reliability and self-healing
redundancy
• Functional module
check and self-
healing
Environment

Security • Software check


Power Anti- Anti- Moisture and self-healing
Temperature Altitude Dust proof EMC standards
supply vibration corrosion proof
compliance

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 7


Module-Level Reliability — Environment
Module System Solution
Level Level Level

Systematic Anti-vibration Solution Anti-corrosion Design Thermal Design

• Temperature monitoring: Multiple


temperature monitoring points are designed in
• Die-cast alloy disk guide rails: The shock • Anti-corrosion techniques for disks: the system to for real-time monitoring.
resistance quality of the die-cast zinc base Electroless Nickel/Immersion Gold (ENIG) and • Intelligent fan speed adjustment: When
alloy guide rails reduces the vibration passed Solder Paste VIAs (SPV) techniques detecting an abnormal temperature, the
from the enclosure to disks. remarkably increase the service life and system adjusts the temperature using
• Multi-level fan vibration isolation: The multi- reliability of disks used in contaminated intelligent fan speed adjustment to ensure that
level (horizontal and vertical) vibration environments. the system works at a proper temperature.
attenuation mechanism reduces 40% vibration • Anti-corrosion techniques for controllers: • Over-temperature protection: If an abnormal
between fans, supports, and the enclosure. The anti-corrosion techniques along with the temperature persists for a long time, which
• Air baffles that reduce vibration and noise: temperature rise test and voltage distribution may adversely affect system operation and
The design of flow equalization for air baffles design provide comprehensive protection for data reliability, the system will take protection
enables the fans to dissipate heat evenly and controllers, prolonging controller service life measures such as service suspension and
reduces the vibration and noise caused by and improving controller reliability in system power-off to prevent the abnormal
turbulence. contaminated environments. temperature from causing a system failure.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 8


Module-Level Reliability — Disk Reliability
Module System Solution
Level Level Level

ERT Joint test


About 1000 disks have
Tests/audits during ORT long-term test
Sampling test by Huawei production
been tested for three the supplier's for product samples
Huawei test
months by introducing production after delivery
multiple acceleration
factors. 5-level FA

System test Tiltwatch/Shockwatch


I. System/Disk logs
Locate simple
label problems quickly.
500+ test cases cover:
ERT long-term Circular improvements Disk deployment test
system functions and II. Protocol analysis
performance, reliability test (Quality/Supply/Cooperation) Locate interactive
compatibility with earlier problems accurately.
versions, disk system
reliability III. Electrical signal
Failure analysis analysis
Locate complex disk
Disk test Online quality data problems.
Joint test Component certification/ Start:
analysis/warning IV. Disk teardown for
100+ test cases cover: system
Preliminary test for R&D analysis
functions, performance, Qualification test samples/Analysis on Design review Locate physical
compatibility, reliability,
firmware, vibration,
product applications Disk selection RMA reverse damages of disks.
temperature cycle, disk V. Data restoration
head, disk, special
specifications verification
Introduction

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 9


Module-Level Reliability — HSSD Reliability
Module System Solution
Level Level Level

Data Redundancy Wear Leveling


Rebuilding of bad pages Using redundant information Erase cycle Erase cycle
⊕ 100% 100%

50% 50%

x Block Block
Uncorrectable
• Die-level Multi-copy & RAID: metadata (multiple copies) and user data (RAID)
• Data restoration: LDPC, read retry, and intra-disk XOR that enable data Wear leveling: periodically moves data blocks so that the data blocks with less wear
restoration using redundancy can be used again.

Bad Block Management and Background Inspection Advanced Management

Logical block Physical block Data management module


Block 0 Block 0 2
Block 1 Invalid block Disk management module
Block 2 Block 2
Block 4 Block 3

2 Unused 1 DISK
DISK DISK
DISK
Unused
Block 5 Block
Reservoir
1
Unused

1. Background inspection: combines read inspection and write inspection and 1. Online self-healing: restores a disk to its factory settings online.
proactively reports bad blocks detected during inspection. 2. Die failure: active reporting and capacity reduction
2. Bad block isolation: detects, migrates, and isolates bad blocks. 3. Power failure protection: flush dirty data when power failure using capacitor

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 10


Module-Level Reliability — HSSD RAID
Module System Solution
Level Level Level

Conventional SSD: silent data corruption HSSD: bad block (die) self-recovery + single-disk
(D3) + single-disk fault → data loss fault → no data loss
RAID Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 4 (intra-HSSD RAID)

RAID 4-x Die Die Die Die Die Die


CKG0 D0 D1 D2 D3 D4 P Parity
column
RAID 4-x Die Die Die Die Die Die

CKG1 D0 D1 D2 D3 P D4 Data
RAID 4-x Die Die Die Die Die Die column

... ... ... ... ... ...


CKG2 D0 D1 P D2 D3 D4
RAID 4-n Die Die Die Die Die Die

Package 1 Package 2 Package 3 Package 4 Package 5 Package 6

 Without RAID: Silent data corruption may occur on disks. (HDDs may have bad sectors and SSDs may have bad blocks, on which data is
unavailable.) If such bad sectors or blocks are seldom accessed, corruption of data in them cannot be detected or rectified in time. Once a
disk fails, data in these bad sectors or blocks cannot participate in reconstruction, resulting in loss of user data.
 With RAID: HSSDs periodically scan data blocks and restore detected bad blocks using intra-disk RAID. In addition, data in bad blocks can
be restored in real time using inter-disk RAID when these blocks are accessed by a host or participate in data reconstruction. In this way, data
will not be lost.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 11


Contents
1 Storage Reliability Metrics

2 Module-Level Reliability

3 System-Level Reliability

4 Solution-Level Reliability

5 O&M Reliability

6 Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 12


Module System Solution
System-Level Reliability Level Level Level

High service availability


 Controller failover in Solid data reliability
seconds  Multiple cache copies
 Continuous mirroring  RAID 2.0+
for high-end models Server
 Fast reconstruction
 HyperMetro-Inner for  Reconstruction
high-end models offloading
Controller
 Overload control  Dynamic
reconstruction
 E2E data protection

High disk fault tolerance


 Wear leveling/Anti–wear
leveling
 Bad sector/block scanning
and repair
 Online diagnosis
 Quick response to slow I/Os
 Isolation of low disks

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 13


High Service Availability Module
Level
System
Level
Solution
Level

Host

1
Front end Controller Controller Front end
Interface
module
Interface
module
enclosure enclosure Interface
module
Interface
module
E2E Redundancy Design
A B

2Controller 1. Controller switchover is transparent to services: front-


Controller Controller Controller Controller Controller Controller Controller

Interface

Interface
end interconnect I/O modules (FIMs), protocol offloading,

module

module
A B C D A B C D

Interface

Interface
module

module
and controller failover within seconds
2. Services are not interrupted if multiple controllers are
Software Software Software Software 5 Software Software Software Software
faulty (HyperMetro-Inner): three cache copies,
3 Interface

Interface
module

module
Interface continuous mirroring, cross-engine mirroring, and full-

Interface
module

module
mesh back end.
3. Services are not affected in the case of a software
fault: process availability detection, startup of processes
Interface Interface Interface Interface within seconds upon a fault, and intermittent isolation of
module module
Inter- Inter- module module
enclosure enclosure frequently abnormal background tasks.
Back end exchange exchange Back end
4. Services are not interrupted if multiple disks are
faulty: EC-2/EC-3 for user data
Disk
enclosure 5. Controllers do not reset if interface modules for inter-
4 enclosure exchanging are faulty: multiple interface
RAID ...
modules for redundancy on high-end storage, and TCP
... ... ... ... ... ... ... ... ... ... ...
forwarding for mid-range and entry-level storage with a
RAID ... single interface module

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 14


High Service Availability — Module
Level
System
Level
Solution
Level

Controller Failover Within Seconds


Host Quick Controller Failover

Controller
enclosure A
1. The host delivers I/Os to controller A: If all
FIM 1 FIM 2
controllers are normal, I/Os are delivered to
controller A through FIM 1.
Controller A Controller B Controller C Controller D 2. Controller A is faulty: FIM 1 and controller B
4 detect that controller A is unavailable by means
1 of interrupts.
3
3. Service switchover: Services are quickly
2 X (within 1s) switched to controller B that has data
copies of controller A, by switching the vNode.
Then FIMs are instructed to refresh the
distribution view.
BIM 1 BIM 2 4. I/O path switchover: FIM 1 returns BUSY for
the I/Os that have been delivered to controller
Disk enclosure A. The retried and new I/Os delivered by the
... host are delivered to controller B based on the
RAID
... ... ... ... ... ... new view.
RAID ...

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 15


High Service Availability — Module System Solution
Level Level Level
Continuous Mirroring
1 2 3 4 2 3 4 2 3

2* 1* 4* 3* 1 4* 3* 1 4

3* 1*
1* 2*

4* 2*

Controller A Controller
Controller B Controller C
B Controller C Controller
Controller D
D Controller AA Controller
Controller Controller B
B Controller C Controller
Controller C Controller D
D Controller A Controller B Controller
Controller A C Controller
Controller C Controller D
D

Normal Failure of one controller (controller A) Failure of one more controller (controller D)

 Continuous mirroring (ensuring service continuity even when seven out of eight controllers are faulty): If controller A is faulty,
controller B selects controller C or D as the cache mirror. If controller D fails at the moment, cache mirroring is implemented between
controller B and controller C to ensure dual-copy redundancy.
 Service continuity: If a controller fails, its mirror controller establishes a mirror relationship with another functional controller within 5
minutes. This design increases the service availability by at one nine and ensures service continuity in the event that multiple
controllers fail successively.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 16


High Service Availability — Module
Level
System
Level
Solution
Level

HyperMetro-Inner for High-End Storage


Tolerating simultaneous failure of two controllers Tolerating failure of a single controller enclosure

Shared front end Shared front end Shared front end Shared front end
Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller
A B C D A B C D A B C D A B C D

Shared back end Shared back end Shared back end Shared back end
Smart Smart
disk enclosure disk enclosure

 The global cache provides three cache copies across controller enclosures.  The global cache provides three cache copies across controller
 If two controllers fail simultaneously, at least one cache copy is available. enclosures.
 A single controller enclosure can tolerate simultaneous failure of two controllers  A smart disk enclosure connects to 8 controllers (in 2 controller
with the three-copy mechanism. enclosures) through BIMs.
 If a controller enclosure fails, at least one cache copy is available.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 17


High Service Availability — SmartQoS
Module System Solution
Level Level Level

Upper Limit Control


Host • Priority setting: The storage system converts the traffic control objective into a
number of tokens. You can set upper limit objectives for low-priority LUNs or
snapshots to guarantee sufficient resources for high-priority LUNs or snapshots.
• Token application: The storage system processes the dequeued LUN or
snapshot I/Os by tokens. I/Os can be dequeued and processed only when
sufficient tokens are obtained.

Burst Quota
• Token accumulation: If the performance of a LUN, snapshot, LUN group, or
host is lower than the upper threshold within a second, a one-second burst
duration is accumulated. When the service pressure suddenly increases, the
LUNs whose traffic LUNs whose LUNs whose performance can exceed the upper limit and reach the burst traffic. The
does not reach the traffic reaches the traffic far exceeds accumulated tokens are used by the current objects and last the configured
lower limit lower limit the lower limit duration. In this way, the system can respond to the burst traffic in time.
No Burst Traffic
suppression prevention suppression Lower Limit Guarantee
• Minimum traffic: If each LUN is configured with the minimum traffic
(IOPS/bandwidth) by default, the minimum traffic must be ensured when the
LUN is overloaded.
• Traffic suppression for high-load LUNs: When the system is overloaded, if
the traffic of some LUNs does not reach the lower limit, the system performs
... load rating on all LUNs. The system provides a loose traffic condition for
medium- and low-load LUNs based on the load status. The system suppresses
Disk the traffic of high-load LUNs until the system releases sufficient resources to
enable the traffic of all LUNs to reach the lower limit.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 18


Module System Solution
Solid Data Reliability Level Level Level

Host
Host
E2E Data Protection
C1 C2 C3 C4 C5 C6 C7 C8
data
1. Cache data redundancy: Two or three copies of
Controller enclosure A Controller enclosure B cache data ensure no data loss in the case that
multiple controllers or a single controller enclosure is
Controller Controller Controller Controller Controller Controller Controller Controller
A B C D A B C D faulty.
C1 C1 C5 C5 C5 C1 2. Disk data redundancy: RAID 2.0+ ensures that user
data on disks will not be lost in the case that multiple
C6 C2 1 C2 C2 C6 C6 disks are consecutively or simultaneously faulty. Data
reconstruction is offloaded to smart disk enclosures,
C7 C3 C3 C3 C7 C7 further ensuring data reliability.
3. Intra-disk data redundancy: RAID 4 ensures die-
C4 C4 C8 C8 C8 C4
level redundancy within a disk, preventing user data
loss in the case of bad blocks or die failure.
5 4. Maintained data redundancy even when RAID
22+2  Disk
fault 
21+2 4 disks are insufficient: Dynamic reconstruction, in
which fewer data disks are involved, is used to
2
maintain redundancy if the number of member disks
Disk Disk Disk Disk
... Disk Disk Disk does not meet RAID requirements.
C1 C2 C3 C4 C5 C6 C7 +2
5. E2E data consistency: E2E PI and parent-child
hierarchy verification ensure that data on I/O paths will
not be damaged.
Die Die Die Die Die Die Die Die ... Die Die
3 +1

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 19


Solid Data Reliability — Multiple Module
Level
System
Level
Solution
Level

Cache Copies
Third copy saved in another
Controller enclosure 0 controller enclosure
Controller enclosure 1

1 1* 1** 5** 5 5* 5** 1**

2* 2 6** 2** 6* 6 2** 6**

3** 7** 3 3* 7** 3** 7 7*

8** 4** 4* 4 4** 8** 8* 8


Third copy saved in another
controller

Controller A Controller B Controller C Controller D Controller A Controller B Controller C Controller D

 Data will not be lost if two controllers are faulty: Three copies of cache data are supported. For host data with
the same LBA, the system creates a pair of cache data copies on two controllers and the third copy on another
controller.
 Data will not be lost if a controller enclosure is faulty: When the system has two or more controller enclosures,
three copies of cache data are saved on controllers in different controller enclosures. This ensures that cache data
will not be lost in the event that a controller enclosure (containing four controllers) is faulty.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 20


Solid Data Reliability — RAID 2.0+
Module System Solution
Level Level Level

Conventional LUN
Block
LUN
RAID virtualization-
based RAID

RAID Hot
RAID
group 1
spare
group 2

Conventional RAID Block Virtualization-based RAID


Resource management based on disks Resource management based on data blocks
I/Os of a LUN are processed by limited disks in a RAID group. I/Os to each LUN are evenly distributed to all disks, balancing performance.
Slow reconstruction: If a single disk is faulty, only a limited number of disks in Fast reconstruction: If a single disk is faulty, all disks participate in
the RAID group participate in reconstruction. reconstruction.
A hot spare disk must be specified. Once the hot spare disk is faulty, it must be Reconstruction can be performed as long as there is free space, independent of
replaced in time. specific hot spare disks.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 21


Solid Data Reliability — Fast Reconstruction
Module System Solution
Level Level Level

HDD 0 HDD 4 HDD 8


01 02 03 41 42 43 81 82 83 CKG 0 01 26 33 74 88
HDD 0 HDD 4 HDD 8
04 05 06 44 45 46 84 85 86
07 08 09 47 48 49 87 88 89 x
CKG 1 03 16 44 66 73

HDD 1 HDD 5 HDD 9


HDD 5 81
RAID 5 (4+1) 11 12 13 51 52 53 91 92 93

HDD1 (hot spare HDD 9
disk) 14 15 16 54 55 56 94 95 96
0 1 x 2 3 4
17 18 19 57 58 59 97 98 99
HDD 2 HDD 6
x
CKG 2 14 28 42 56 96
5
21 22 23 61 62 63


HDD 2 HDD 6
68
24 25 26 64 65 66
27 28 29
HDD 3
67 68 69
HDD 7

HDD 3 HDD 7 31 32 33 71 72 73
CKG 3 21 34 51 71 99
Rebuilding 34 35 36 74 75 76
37 38 39 77 78 79
Rebuilding
Reconstruction using conventional RAID: During reconstruction,
data is read from the other functional disks and reconstructed. Then, Reconstruction using RAID 2.0+: RAID 2.0+ supports dozens of member
reconstructed data is written to a hot spare disk or a new disk. The disks. When a disk fails, other disks participate in reconstruction reads and
write performance of a single disk restricts the reconstruction. writes, greatly shortening the reconstruction time. As more disks share
Therefore, the reconstruction takes a long time. reconstruction loads, the load on each disk significantly decreases.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 22


Solid Data Reliability — Reconstruction Module
Level
System
Level
Solution
Level

Offloading
3 Calculates the faulty Controller-based Reconstruction
Controller blocks. 4 Writes faulty
blocks into the hot
1. Reconstruction occupies controller computing resources
1 Initiates a disk spare space.
read request. (CPU resources): If a single or multiple disks are faulty, all
data is computed on the controller, causing the controller
Disk enclosure 1 Disk enclosure 2 CPU to be overloaded. The host I/O processing capabilities
are adversely affected.
2 Reads 23 blocks.
2. Reconstruction occupies massive data write bandwidth:
All data on the disks in the RAID group is read to the
Disk 1 Disk 2 ... Disks
3-5
Disk
36
Disk 1 Disk 2 ... Disk
35
Disk
36
controller for computing, occupying the data write bandwidth.
As a result, the host I/O write bandwidth is affected.

5 Use P', Q', P", and Q" Reconstruction Offloading (2x Better Performance)
Controller to calculate D1 and Dx'. 6 Writes faulty
blocks into the hot
1.1 Initiates a spare space.
1. The computing of RAID member disks is offloaded to
4.1 Calculates
reconstruction task. 1.2 Initiates a 4.1 Calculates
smart disk enclosures: Smart disk enclosures have idle
transmission
P' and Q'. reconstruction task. transmission P" and Q". CPU resources. The disk recovery data read by the IP
enclosure is calculated in the enclosure by using P' and Q'.
Disk Disk 2. Reconstruction occupies a small amount of data write
Disk read Compute Disk read Compute
enclosure 1 module module enclosure 2 module module bandwidth: When RAID data recovery is involved, data on
2.1 Initiates a
2.1 Initiates a disk 3.1 Reads 3.2 Reads the disk to be recovered is computed in the smart disk
disk read 12 blocks.
enclosure and does not need to be transmitted to the

Disk
read request.

Disk
... Disks
12 blocks.

Disk
request.

Disk Disk
... Disk Disk
controller, reducing back-end bandwidth occupation and
reducing the impact of reconstruction on system
1 2 3-5 36 1 2 35 36
performance.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 23


Solid Data Reliability — Dynamic Module
Level
System
Level
Solution
Level

Reconstruction
Normal: RAID (4+2) A RAID member disk is faulty. Dynamic reconstruction: RAID (3+2)

Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6

CKG0 D0 D1 D2 D3 P Q CKG0 D0 D1 D2 D3 P Q CKG0 D0 D1 D2 D3 P Q

D0+D2+D3=P'+Q'

CKG1 D0' D1' D2' P' Q'

Insufficient
disks

For a RAID group has M+N members (M indicates data columns and N indicates parity columns):
 Common reconstruction: When a disk is faulty, the system uses an idle CK to replace the faulty one and restores data to the idle CK. If the
number of member disks in the disk domain is fewer than M+N, two CKs reside on the same disk, decreasing the RAID redundancy level.
 Dynamic reconstruction: If the number of member disks in the disk domain is fewer than M+N, the system reduces the number of data
columns (M) and retains the number of parity columns (N) during reconstruction. This method retains the RAID level, ensuring system
reliability.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 24


Solid Data Reliability — E2E Module
Level
System
Level
Solution
Level

Data Protection
Protection point 2: checksum Protection point 3: Metadata CRC
Host verification between data and Protection point 4: Metadata DIF
8 KB = 16 sectors (512 B* + 8 B) metadata verification
Writes data. Reads data 512 B PI (8 B)
Per 8 KB
Inserts PI. Verifies PI. 512 B PI (8 B)
Checksum 508 B CRC (4 B) PI (8 B)
512 B PI (8 B)
Metadata (parent)
Storage system software ... Protection point 5: Parent-
child hierarchy verification of
512 B PI (8 B) metadata

508 B CRC (4 B) PI (8 B)
Verifies PI. 512 B PI (8 B)
Verifies PI.
…… Data
Protection point 1: data PI Metadata (child)
verification *B: bytes

Data protection at hardware boundaries: Data verification is performed at multiple key nodes on I/O paths within a storage system, including front-end chips,
controller software front end, controller software back end, and back-end chips.
Multi-level software based data protection: For each 512-byte data, in addition to 8-byte PI (two bytes of which are CRC bytes), the system extracts the CRC
bytes in 16 PI sectors to form the checksum and stores the checksum in the metadata node. If skew occurs in a single or multiple pieces of data (512+8), the
checksum is also changed and becomes inconsistent with that saved in the metadata node. When the system reads the data and detects the inconsistency, it uses
the RAID redundancy data on other disks to recover the error data, preventing data loss.
Metadata protection: Metadata is organized in a tree structure, and a parent node of metadata stores the CRC values of its child nodes, which is similar to the
relationship between data and metadata. Once the metadata is damaged, it can be verified and restored using the parent and child nodes.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 25


Solid Data Reliability — E2E Data Module
Level
System
Level
Solution
Level

Protection (Example)
Host

Writes data. Reads data.

Calculates data CRC and compares it with existing PI. If they are
Inserts PI. inconsistent, data is damaged. The system then restores the data
using RAID and returns a success message to the host.

Service layer

Compares the PI (CRC) in data with that recorded in the


corresponding metadata. If they are inconsistent, read skew
Records PI in metadata. occurs. The system restores the data using RAID and returns a
success message to the upper layer.

Data Layer

Verifies PI and marks that data Reads data and verifies PI. If the verification fails, the system
is incorrect if verification fails. considers that data on the disk is damaged, uses RAID to
restore the data, and returns a success message to the upper
layer.

Disk

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 26


Module System Solution
High Disk Fault Tolerance Level Level Level

• Operation
record
• Access model • Environment
optimization monitoring
• Service flow control and • Hibernation
balancing Operation • Firmware
• Separation of user data Management
management
and metadata
Access
• Wear leveling/Anti– optimization
wear leveling
Diagnosis
Prediction
• Health evaluation
• Bad sector/block scanning
and repair
• Online diagnosis

• Access Access Fault


authentication configuration tolerance
• Sub-health Recovery
status • Quick response to slow I/Os
prediction • Isolation of slow disks
Isolation
• Parameter Offline • Refined error code handling
configuration • Fault isolation • Data restoration and migration
• Resource pre- • Deep inspection • Bit error and intermittent
allocation • NPF problem disconnection processing
processing
• State maintaining
• Securely offline

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 27


Solid Disk Reliability — Wear Leveling and Module
Level
System
Level
Solution
Level

Anti-wear Leveling
Global Wear Leveling and Anti-wear Leveling
100%
Wear Leveling
80%
60%
SSDs can only withstand a limited number of read and write
Wear operations. The system evenly distributes workloads to
leveling 40% each SSD, preventing some disks from failing due to
20% continuous frequent access.
0%
SSD0 SSD1 SSD2 SSD3 SSD4 SSD5 SSD6 SSD7 SSD8 SSD9

100% Anti-wear Leveling

90%
Anti- To prevent simultaneous multi-SSD failures, the system
80% starts anti-global wear leveling when detecting that the SSD
wear
leveling wear has reached the threshold. Unbalanced data
70%
distribution on the SSDs makes their wear degrees differ by
60% at least 2%.
SSD0 SSD1 SSD2 SSD3 SSD4 SSD5 SSD6 SSD7 SSD8 SSD9

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 28


Solid Disk Reliability — Bad Module
Level
System
Level
Solution
Level

Sector/Block Scanning and Repair


Bad Sector Repair Technical Principles
 Storage systems periodically scan disks for potential faults.
 Scan algorithm: cross scan, dynamic rate adjustment
0 1 2 3 4 5
 When detecting a bad sector, the storage system employs
RAID group redundancy to recover data on the bad sector and
writes the recovered data to disk (disk remapping will isolate the
faulty block and map the block to the internal reserved space).

IP/FC SAN

5' = 3 XOR 4 XOR P1

Technical Highlights
 The disk vulnerability can be detected and removed as early as
0 1 2 P0 possible to minimize the system risks due to disk failure.

3 4 P1 5
3' 4' P1'

RAID/CKG

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 29


Solid Disk Reliability — Online Diagnosis
Module System Solution
Level Level Level

• Collects statistics on minor disk errors and detects


head or disk abnormalities in advance.
Monitor

• Isolates a disk for diagnosis and recovery if the disk


malfunctions for the first time.
• Isolates and stops using a disk if the disk malfunctions
Isolate for several times.

• Implements a series of operations, such as power-on,


power-off, SMART check, I/O retry, verification, and
disk self-check, to analyze disk failure causes and
Diagnose degrees.
• Repairs bad sectors to recovery disks.

• Diagnosis succeeded: The disks are connected to the


system and continue to provide services.
• Diagnosis failed: The disks are no longer connected to
Restore the system. RAID reconstruction starts.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 30


Solid Disk Reliability — Quick Module
Level
System
Level
Solution
Level

Response to Slow I/Os


Quick Response to Slow I/Os Technical Principles
 When a read error occurs due to physical bad sectors, head
a problems, vibration, or electron escape, the system will handle
the error by adding voltage, re-reading data, or re-mapping. As
a result, the I/O response time may be prolonged.
 The system monitors the response to I/Os delivered to disks. If
the response time of I/Os to a disk exceeds the upper
IP/FC SAN threshold, the system stops accessing the disk and responds
to host requests quickly using data restored on other RAID
member disks. This disk can be accessed if no fault is found or
after it recovers from a fault.

a = b XOR c XOR d Technical Highlights


The adverse impact brought by slow disks or slow disk response
within a short period can be eliminated before disks are isolated.
xa b c d RAID
I/O requests can be responded in a timely manner, preventing
services from being affected by long retry and recovery policies.

Disk 1 Disk 2 Disk 3 Disk 4

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 31


Solid Disk Reliability — Slow Disk Module
Level
System
Level
Solution
Level

Detection and Isolation


Technical Principles
Slow Disk Detection and Isolation
 Disks are classified into domains based on disk characteristics and array
Slow attributes (such as the disk rational speed, interface type, and medium type).
disk The average response time of all disks in a domain is used as a baseline to
find out the disks that are relatively slower.
Average  If a disk is identified to be slower than other disks in multiple periods, the
response system determines that this disk is a slow disk and returns a message
time indicating that the disk is temporarily isolated for diagnosis. If the disk cannot
be recovered, it will be permanently isolated.

Technical Highlights

 When all disks in the system become slow, the response time of a single disk
cannot be much greater than that of other disks. In this case, no disk will be
isolated, reducing the failure rate.
Disk  Slow disks are identified and isolated only after diagnosis and repair.

Abnormal Normal

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 32


Contents
1 Storage Reliability Metrics

2 Module-Level Reliability

3 System-Level Reliability

4 Solution-Level Reliability

5 O&M Reliability

6 Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 33


Module System Solution
Solution-Level Reliability Level Level Level

Intra-system Reliability Solution


Production Production Remote DR 1. HyperSnap: snapshot activation using multi-time-segment
DC A DC B center C cache technology does not block host I/Os; high-density
snapshots
Host cluster 2. HyperClone: complete physical copy, isolation of the
source and target LUNs; incremental synchronization
between the source and target LUNs
FC/IP
4 SAN Inter-system Reliability Solution
3 Asynchronous
replication 3. HyperMetro
Intra-city  Active-active architecture: Active-active LUNs are
FC/IP SAN active-active FC/IP SAN readable and writable in both DCs and data is
solution synchronized in real time.
1 2 Real-time data  High reliability: Cross-site bad block repair improves
Cloning Snapshot
Snapshot Cloning synchronization Public cloud system reliability.
 Elastic scalability: expanded to the 3DC solution

...
CloudBackup
...

based on the remote replication solution.


Dorado V6
4. HyperReplication: Synchronous and asynchronous data
Dorado V6 5 replication across storage systems provides intra-city and
remote data protection.
IP IP HEC/AWS
network network Cloud Backup Solution
5. CloudBackup: The public cloud object storage is used as
Quorum device
(Optional)
the backup storage to prevent data loss caused by faults on
storage devices in the enterprise DC.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 34


Solution-Level Reliability — HyperSnap
Module System Solution
Level Level Level

Overview
• HyperSnap quickly captures online data and
generates snapshots for source data at specified
points in time without interrupting system services,
preventing data loss caused by viruses or
08:00 snapshot
misoperations. Those snapshots can be used for
backup and testing.
12:00 snapshot
Source Snapshot
LUN LUN
16:00 snapshot Highlights
• The innovative multi-time-segment cache technology
20:00 snapshot
can continuously activate snapshots at an interval of
several seconds. Activating snapshots does not block
OceanStor
host I/Os, and host services can be quickly
responded.
• Based on the RAID 2.0+ virtualization architecture, the
system flexibly allocates storage space for snapshots,
making resource pools unnecessary.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 35


Solution-Level Reliability — HyperClone
Module System Solution
Level Level Level

Overview
 HyperClone generates a complete physical
copy (target LUN) of the production LUN
(source LUN) at a point in time. The copy can
be used for backup, testing, and data analysis.

Highlights
 A complete, consistent physical copy
 Isolation of source and target LUNs,
eliminating mutual impact on performance
 Consistency groups, enabling the consistent
splitting of multiple LUNs
 Incremental synchronization
 Reverse incremental synchronization (from the
target LUN to the source LUN)

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 36


Solution-Level Reliability — HyperMetro
Module System Solution
Level Level Level

HyperMetro Architecture Principles


 One Huawei OceanStor storage system is deployed in both DC A
DC A DC B
and DC B respectively in active-active mode. The two storage
systems concurrently provide read and write services for service
hosts in the two DCs. No data is lost in the event of the breakdown of
Oracle RAC any DC.
cluster/VMware
vSphere cluster
FusionSphere cluster
...
HyperMetro Design
WAN
FC/IP FC/IP
SAN
SAN SAN  Active-active architecture: Active-active LUNs are readable and
writable in both DCs and data is synchronized in real time.
Production

 High reliability design: Dual-arbitration mechanism and cross-site

Production
Storage

bad sector repair boost system reliability.

Storage
Real-time data
synchronization  Flexible scalability design: SmartVirtualization, HyperSnap, and
HyperReplication are supported. HyperMetro can be expanded to the
Disaster Recovery Data Center Solution (Geo-Redundant Mode).
IP IP
network network

Quorum device(Optional)

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 37


Solution-Level Reliability Module
Level
System
Level
Solution
Level

— HyperReplication
Overview
Production center DR center
 HyperReplication supports both synchronous and
asynchronous remote replication between storage
systems. It is used in disaster recovery solutions to provide
intra-city and remote data protection, preventing data loss
caused by disasters and improving business continuity.

Highlights

 Remote replication interworking among entry-level, mid-


range, and high-end storage systems
 RPO within seconds for asynchronous replication and zero
RPO for synchronous replication
 Consistency group
HyperReplication
 Incremental synchronization
 Fibre Channel and IP links
 Network: bi-directional replication, 1:N, and N:1
Huawei Huawei  3DC: cascading replication and parallel replication
OceanStor OceanStor

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 38


Solution-Level Reliability — 3DC
Module System Solution
Level Level Level

Cascaded Architecture Parallel Architecture


Intra-city DR center Production Intra-city DR center
Production
B center A B
center A

A A
Synchronous/ Synchronous/
Asynchronous Asynchronous
replication replication
SAN Asynchronous SAN SAN
SAN
replication

A A' A A'
Remote DR Remote DR
center C center C
1. The intra-city DR center undertakes
remote replication tasks and has Asynchronous
replication
minor impact on services in the
production center.
If the storage system in the
2. If the storage system in the production
SAN production center malfunctions, SAN
center malfunctions, the intra-city DR
the intra-city or remote DR center
center takes over services and keeps
A" can quickly take over services.
a data replication relationship with the A"
remote DR center.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 39


Contents
1 Storage Reliability Metrics

2 Module-Level Reliability

3 System-Level Reliability

4 Solution-Level Reliability

5 O&M Reliability

6 Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 40


O&M Reliability — Fast Upgrade
System Solution
O&M
Level Level

Host
Upgrade Transparent to Hosts
Controller enclosure  Host unaware of upgrade: Each controller has an I/O
Front-end Front-end holding module, which holds current I/Os during
interface module interface module component upgrade and restart. After being upgraded,
I/O holding components continue to process the I/Os so that hosts
module do not detect any connection interruption or I/O
Controller Controller Controller Controller
A B C D
exception.
 Component upgrade: The system upgrade is divided
Phase 1 I/O holding I/O holding I/O holding I/O holding into two phases. The software components (processes)
Component Component Component Component
with redundant units are upgraded first. After the
1 1 1 1 software packages are uploaded and the processes
Component
2
Component
2
Component
2
Component
2
are restarted, the second phase is triggered.
Component Component Component Component  Zero performance loss: Each software component
N N N N
Phase 2 restarts with 1s. The front-end interface module returns
BUSY for failed I/Os during the upgrade. The host re-
delivers the failed I/Os, and the performance is restored
to 100% within 2 seconds.
 Short upgrade duration: No host compatibility issue is
Disk enclosure involved. Host information does not need to be
collected for evaluation. The entire storage system can
Fast restart ……… be upgraded within 10 minutes as controllers do not
need to be restarted.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 41


O&M Reliability — Intelligent Prediction
System Solution
O&M
Level Level

Elimination of potential risks, Capacity monitoring, identifying


improving reliability. overloaded resources in advance

 Predict disk risks 14 day ahead.

 Use the XGBoost algorithm to identify


80% disk risks with only a 0.1% mis-  Predict the capacity trend in  Predict capacity
reporting rate. the next 12 months and consumption and identify
determine the capacity overloaded resources.
requirements.

Intelligent capacity planning and on-


Massive data analysis, building mature
demand procurement, reducing TCO
risk disk prediction models

 Analyze 500,000+ disks and


20+ billion feature records.

 Verify enterprise data center models  Optimize idle resources to  Evaluate the capacity
for over 600 days. improve resource requirements and provide
utilization. detailed expansion solutions.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 42


Contents
1 Storage Reliability Metrics

2 Module-Level Reliability

3 System-Level Reliability

4 Solution-Level Reliability

5 O&M Reliability

6 Reliability Tests and Certifications

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 43


Reliability Tests and Certifications
Hardware Tests and Certifications Ecosystem Compatibility Certifications

• After 10+ years of technical accumulation and


continuous investment, Huawei has established
the largest storage interoperability lab in Asia.
The lab provides 10,000+ pages of
Type Purpose Authentication Result interoperability lists and 1 million+ verified
Ensure that the product meets the requirements of the
Met the mandatory
service scenarios, covering 4000+ software and
corresponding EMC standards, real-world
EMC test
electromagnetic environment, and device or system admission certification hardware versions of mainstream applications,
compatibility. requirements of each
region/country and operating systems, virtualization products,
Ensure the personal safety when using the products, organization:
reduce the injury caused by electric shock, fire, heat, • China: CCC
servers, and switches. Huawei has cooperated
Safety test mechanical damage, radiation, chemical damage, and •
energy, and meet the admission requirements of each
European with mainstream vendors and earned 1500+
Commission: CE
country. • Japan: VCCI-A certificates. Huawei products are compatible
• Russia: CU
Environment Check that the products meet the requirements.
• ... with mainstream vendors' new products as soon
(climatic) test Expose defects in design, process, and materials.
Passed some optional
as the new products are launched.
Improve the environment adaptability of the products
to mechanical stress during storage and transportation certifications, such as • Huawei is the storage vendor that has the most
Environmental to ensure qualified product appearance, structure, and China's:
(mechanical) test performance and ensure that the product can • Earthquake upstream and downstream partner resources.
resistance
withstand the adverse impact caused by external
mechanical stress on the equipment. certification
Huawei is the top strategic partner of Seagate,
Find the weak points of the products and improve
• Environmental Western Digital, Intel, SAP, and Microsoft.
HALT test Labeling certification
product reliability.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 44


Take Away
Key Reliability Technologies of OceanStor Dorado V6
• High Service Availability (Tolerate 7 out of 8 controllers failures)
Controller Failover within Seconds, Continuous Mirroring, HyperMetro-Inner

• Solid Data Reliability


Multiple Cache Copies, RAID 2.0+, E2E Data Protection
Fast Reconstruction, Reconstruction Offloading, Dynamic Reconstruction

• High Disk Fault Tolerance


Wear Leveling/Anti–wear Leveling, Bad Sector/Block Scanning and Repair, QuickResponse to Slow I/Os, Intra-disk RAID

• Disaster Recovery Solution


HyperSnap, HyperClone, HyperReplication, 3DC(Cascaded and Parallel Architecture)

• O&M Reliability
Fast Upgrade

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 45


Thank You
www.huawei.com

Copyright © 2020 Huawei Technologies Co., Ltd. 2019. All rights reserved.
All logos and images displayed in this document are the sole property of their respective copyright holders. No endorsement, partnership, or affiliation is suggested or
implied. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results,
future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed
or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 46


Huawei OceanStor 100D Distributed Storage
Architecture and Key Technology

Department:
Prepared By:
Date: Feb., 2020

Security Level: Internal only


OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

2 Huawei Confidential
Typical Application Scenarios of Distributed Storage
Virtual storage pool HDFS HPC/Backup/Archival

Traditional compute-storage convergence Tiered data migration/backup to the cloud


Cloud HDFS Public and
private cloud
service
BSS OSS MSS One storage system for
Tiering to
UHD storage UHD
Data node + Data node + Data node +
Compute node Compute node
HPC data lifecycle
Compute node
management
Ethernet

Hot data Warm data Cold data


HPC HPC
Storage
production SSD to HDD backup Tiered storage of
Data node + Data node + Data node + archival
storage tiering storage cold data
Compute node Compute node Compute node

VM ...VM VM Tiered storage from file to object service


HPC full data lifecycle management

Storage-compute separation

Compute node Compute node Compute node File storage VM Application Database service
Cloud storage resource
pool Ethernet

Cloud storage resource pool Backup/Archive software

Data node Data node Data node

Public cloud
Cloud storage resource pool

• Storage space pooling and on-demand expansion, • Storage-compute separation, implementing elastic compute • HPC meets requirements for high bandwidth and intensive
reducing initial investment and storage scaling IOPS. Cold and hot data can be stored together.
• Ultra-large VM deployment • EC technology, achieving higher utilization than the • Backup and archival, on-demand purchase, and elastic
• Zero data migration and simple O&M traditional Hadoop three-copy technology expansion
• TCO reduced by 40%+ compared with the traditional solution • Server-storage convergence, reducing TCO by 40%+

3 Huawei Confidential
Core Requirements for Storage Systems

Distributed Storage Hot Issue Statistics


Scalability
60.7% 60.1%
58.2% 56.6%
51.5% 50.8%
47.3%

36.4%

Performance Usability
Users'
core
requirements

High Reliability I/O Usability Cost Hotspot Data Features


scalability performance performance security

Source: 130 questionnaires and 50 bidding documents from carriers,


governments, finance sectors, and large enterprises Reliability Cost

4 Huawei Confidential
Current Distributed Storage Architecture Constraints
Mainstream open-source software in the industry Mainstream commercial software in the industry
User LUN/File
Three steps to query the mapping:
LUNs or files are mapped to multiple
Obj
continuous objects on the local OSD.
Obj Obj
Obj
Obj Obj Obj Obj
Obj
Obj Obj
Obj
Protection groups (PGs) have a great
impact on system performance and
layout balancing. They need to be
dynamically split and adjusted based PG PG PG PG
PG PG PG PG
on the storage scale. This adjustment PG PG PG PG
PG PG PG PG
affects system stability. PG PG PG PG

CRUSH map

The minimum management granularity


is by object. The size of an object is OSD OSD OSD OSD
adjustable, which affects metadata
management, space consumption, Local storage
performance, and even read/write
obj File Disk
stability.

ID Binary Data Metadata

The CRUSH algorithm is simple, but does not support RDMA, and the Multiple disks form a disk group (DG). One DG is configured with one SSD cache. Heavy I/O
Performance performance for small I/Os and EC is poor. loads will trigger cache writeback, greatly deteriorating performance.
Design-based CRUSH algorithm constraints: uneven data distribution, uneven disk
Reliability space usage, and insufficient subhealth processing
Data reconstruction in fault scenarios has a great impact on performance.

Restricted by the CRUSH algorithm, adding nodes costs high, and large-scale
Scalability capacity expansion is difficult.
Poor scalability: Only up to 64 nodes are supported.

Usability Lack of cluster management and maintenance interfaces with poor usability Inherit the vCenter management system with high usability.
5 Huawei Community
Confidentialproduct with low cost: EC is not used for commercial, and High cost: All SSDs support intra-DG deduplication and non-global deduplication, but do not
Cost deduplication and compression are not supported support compression. EC supports only 3+1 or 4+2.
Overall Architecture of Distributed Storage (Four-in-One)
VBS (SCSI/iSCSI) NFS/SMB HDFS S3/SWIFT
Disaster Recovery O&M Plane
Block File HDFS Object HyperReplication Cluster
LUN Volume DriectIO L1 Cache NameNode LS OSC
HyperMetro Management
L1 Cache MDS Billing
DeviceManger

Index Layer QoS


Write Ahead Log Snapshot Compression Deduplication Garbage Collection License/Alarm/
Log
Authentication
Persistence Layer
Mirror Erasure Coding Fast Reconstruction Write-back Cache SmartCache eService

Hardware
x86 Kunpeng

Architecture advantages
• Convergence of the block, object, NAS, and HDFS services, implementing data flow and service interworking
• Introducing advantages of the professional storage to achieve optimal convergence of performance, reliability,
usability, cost, and scalability
• Convergence of software and hardware, optimizing performance and reliability based on customized hardware
6 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

7 Huawei Confidential
Data Routing, Twice Data Dispersion, and Load Balancing
Among Storage Nodes
Front-end NIC User Data
Basic Concepts
... N1 SLICE • Node: physical node, that is, a storage server
Front-end module N7 DHT N2
loop
• Vnode: logical processing unit. Each physical node
for data dispersion N3
N6
(first) is divided into four logical processing units. When a
N5 N4
physical node becomes faulty, services processed
by the four vnodes on the faulty node can be taken
vnode vnode
Data processing
over by the other four physical nodes in the cluster.
module vnode vnode
In this case, service takeover efficiency and load
balancing can be improved.
Node
• Partition: A fixed number of partitions are created in
the storage resource pool. Partitions are also units
Plog 1 Plog 2 Plog 3
for capacity expansion, data migration, and data
Data storage ... reconstruction.
module for data
dispersion
(second)
• Plog: partition log for data storage, providing the
read/write interface for Append Write Only services.
Partition 01
The size of a Plog is not fixed. It can be 4 MB or 32
MB, and the maximum size is 4 GB. The Plog size,
Partition 02
redundancy policy, and partition where data is
SSD/HDD stored will be specified during service creation.
Partition 03
When creating a Plog, select a partition based on
load balancing and capacity balancing.
Node Node-0 Node-1 Node-2 Node-3

8 Huawei Confidential
I/O Stack Processing Framework
The I/O channels are divided into two phases:
• Host I/O processing (① red line): After receiving data, the storage system stores one copy in RAM cache and three copies in SSD WAL
cache of three storage nodes. After the copies are successfully stored, a success response is returned to the upper-layer application
① host. The host I/O processing is complete.
• Background I/O processing (② blue line): When data in the RAM cache reaches a certain amount, the system calculates large data
blocks using EC and stores generated data fragments onto HHDs. ( ③ Before data is stored onto HDDs, the system determines
whether to send data blocks to the SSD cache based on the data block size.)

Data Parity
RAM Erasure Coding ②
cache

SSD WAL
cache

SSD cache ③

HDD

Node-1 Node-2 Node-3 Node-4 Node-5 Node-6

9 Huawei Confidential
Storage Resource Pool Principles and Elastic EC
 EC: erasure coding, a data protection mechanism. It implements data redundancy protection by
calculating parity fragments. Compared with the multi-copy mechanism, EC has higher storage utilization
and significantly reduces costs.
 EC redundancy level: N+M = 24 (M = 2 or 4), N+M = 23 (M = 3)

Node 1 Node 2 Node 3 EC 4+2 Partition Table


Partition
Disk Disk Disk Disk Mem1 Mem2 Mem3 Mem4 mem5 mem6
Disk Disk No.
Disk Disk Disk Disk node1 node1 node2 node2 node3 Node3
Disk Disk Partition1
Disk1 Disk3 Disk7 Disk4 Disk6 Disk4

Disk Disk Disk Disk Disk Disk node1 Node3 node1 node3 node2 Node2
Partition2
Disk10 Disk11 Disk1 Disk5 Disk3 Disk2
Disk Disk Disk Disk Disk Disk
...

Disk Disk Disk Disk Disk Disk node3 node2 node1 node3 node1 Node2
Partition
Disk10 Disk11 Disk1 Disk5 Disk3 Disk2

Basic principles:
1. When a storage pool is created using disks on multiple nodes, a partition table will be generated. The number of rows in the partition table is 51200 x 3/(N+M),
and the number of columns is (N+M). All disks are filled in the partition table as elements based on the reliability and partition balancing principles.
2. The mapping between partitions and disks is N:M. That is, a disk may belong to multiple partitions, and a partition may have multiple disks.
3. Partition balancing principle: Number of times that each disk appears in the partition = Number of times that each disk appears in memory
4. Reliability balancing principle: For node-level security, the number of disks on the same node in a partition cannot exceed the value of M in EC N+M.

10 Huawei Confidential
EC Expansion Balances Costs and Reliability
As the number of nodes increases during capacity expansion, the number of data blocks (the value of M in
EC) automatically increases, with the reliability unchanged and space utilization improved.

EC (N+2) EC (N+3) EC (N+4)


EC 2+2
Redundancy Space Redundancy Space Redundancy Space
Ratio Utilization Ratio Utilization Ratio Utilization
Data blocks (2) Parity blocks (2) 2+2 50.00% 6+3 66.66% 6+4 60.00%

4+2 66.66% 8+3 72.72% 8+4 66.66%

6+2 75.00% 10+3 76.92% 10+4 71.43%

Ascending
... 8+2 80.00% 12+3 80.00% 12+4 75.00%

EC 22+2 10+2 83.33% 14+3 82.35% 14+4 77.78%

12+2 85.71% 16+3 84.21% 16+4 80.00%

Data blocks (22) Parity blocks (2) 14+2 87.50% 18+3 85.71% 18+4 81.82%

16+2 88.88% 20+3 86.90% 20+4 83.33%

When adding nodes to the storage system, the customer 18+2 90.00%

can determine whether to expand the EC. Using EC 20+2 90.91%


expansion can increase the number of data blocks (M) and
22+2 91.67%
improve the space utilization, with the number of parity
blocks (N) and reliability unchanged.
11 Huawei Confidential
EC Reduction and Read Degradation Optimize Reliability
and Performance
Data Read in EC 4+2 Data Write in EC 4+2

Data Data
Original Parity
data data
Process EC decoding
Process EC encoding 2+2
(EDS) (EDS) 2+2
Original Parity
data data

Disk1 Disk2 Disk 3 Disk 4 Disk 5 Disk 6 Disk1 Disk2 Disk3 Disk4 Disk5 Disk6

1. The EC can reconstruct valid data and return the data only after read 1. If a fault occurs, using EC reduction to write data can ensure the data write
degradation and verification decoding are performed. reliability. Assume that the original EC scheme is 4+2, when a fault occurs,
2. The reliability of data stored in EC mode decreases if a node becomes the data will be written to the storage system in EC 2+2 mode.
faulty. The reliability needs to be recovered through data reconstruction. 2. EC never degrades write operation, providing higher reliability.

12 Huawei Confidential
High-Ratio EC Aggregates Small Data Blocks and ROW Balances
Costs and Performance
LUN 0 LUN 1 LUN 2
4 KB 4 KB ... 8 KB 4 KB 4 KB ... 8 KB 4 KB 4 KB ... 8 KB

I/O aggregation
Linear space A B C D E ...

Data Write

Plog Data storage using Append Only Plog (ROW-based append write technology
for I/O processing)
A B C D E Full stripe P Q

Node-0 Node-1 Node-2 Node-3 Node-4

Intelligent stripe aggregation algorithm + log appending: Reduces latency and achieves a high ratio of EC 22+2
Host write I/Os are aggregated into full EC stripes based on the write-ahead log (WAL) mechanism for system reliability.
The load-based intelligent EC algorithm writes data to SSDs in full stripes, reducing write amplification and ensuring the host write latency less than 500 μs.
The mirror relationship is created between the data in the Hash memory table and that in SSD media logs. After data is aggregated, random write is changed to
100% sequential write which writes data to back-end media, improving random write efficiency.
13 Huawei Confidential
Inline and Post-Process Adaptive Deduplication and
Compression
Data with low  OceanStor 100D supports global deduplication and compression, as well as adaptive inline and
Opportunity deduplication
If inline Block 1 post-process deduplication. Deduplication reduces write amplification of disks before data is written
table ratios is filtered
deduplication
out by using the
to disks. Global adaptive deduplication and compression can be performed on all-flash SSDs and
fails or is HDDs.
HASH-B opportunity table.
skipped, post-
process Block 2  OceanStor 100D uses the opportunity table and fingerprint table mechanism. After data enters the
3 deduplication is cache, the data is broken into 8 KB data fragments. The SHA-1 algorithm is used to calculate 8 KB
enabled and data data fingerprints. The opportunity table is used to reduce invalid fingerprint space.
block fingerprints
enter the Block 3 HASH-C  Adaptive inline and post-process deduplication: When the system resource usage reaches the
opportunity table. threshold, inline deduplication automatically stops. Data is directly written to disks for persistent
storage. When system resources are idle, post-process deduplication starts.
 Before data is written to disks, the system enters the compression process. The compression is
aligned in the unit of 512 bytes. The LZ4 algorithm is used and deep compression HZ9 is supported.
Promote the
4 opportunity
table to a Media Deduplication and
Write data Service Type Impact on Performance
1 fingerprint Type Compression
blocks. table.
Direct data in the Bandwidth- Deduplication and
fingerprint table Reduced by 30%
intensive services compression enabled
5 after post-
Block1
process Fingerprint IOPS-intensive Deduplication and
deduplication. table Reduced by 15%
All-flash services compression enabled
Block 2
SSDs Bandwidth- Pure write increased by 50% and
HASH-B Compression enabled only
Block 3 intensive services pure read increased by 70%

Block B IOPS-intensive
The fingerprint Compression enabled only Reduced by 10%
table occupies a
services
Block 4 HASH-A little memory, Bandwidth- Pure write increased by 50% and
which supports Compression enabled only
intensive services pure read increased by 70%
Block 5 Direct data in Block A
deduplication of HDDs
2 the fingerprint large-capacity IOPS-intensive
table after inline systems. Compression enabled only None
services
deduplication.

14 Huawei Confidential
Post-Process Deduplication

Node 0 Node 1
Deduplication Data service Deduplication Data
service
2. Injection
Opportunity Opportunity
1. Preparation
table 4. Raw data read table
Address 3. Analysis Address
mapping mapping
table table
5. Promotion
Fingerprint Fingerprint
data table data table
6. Remapping

7. Garbage collection

15 Huawei Confidential
Inline Deduplication Process

1. DSC cache destage

EDS process on Node 0 EDS process on Node 1


Deduplication Data service Deduplication Data
service
Opportunity table
Opportunity
2. Looking up fingerprint
table for inline deduplication
Address
Address mapping
FPCache
table
mapping Returned result
Fingerprint Fingerprint
table
data table data table

3. Writing fingerprint
instead of data

16 Huawei Confidential
Compression Process
I/O data
 After deduplication, data begins to be
compressed. After compression, the data
1. Breaking data into a fixed length
begins to be compacted.
Data of the
same size
 Data compression can be enabled or
2. Compression
disabled as required based on the
LUN/FS granularity.
Data after
compression  The following compression algorithms
B1 B2 B3 B4 B5 are supported: LZ4 and HZ9. HZ9 is a
Typical data
deep compression algorithm and its
layout compression ratio is 20% higher than
in storage
that of LZ4.
512 bytes 1.5 KB 2.5 KB 3.5 KB 4.5 KB 5 KB
 The length of the compressed data
Waste space 512 bytes aligned,
wasting a lot of space varies. Multiple data blocks less than 512
Compression header, which bytes can be compacted, and then
B1 B2 B3 B4 B5 describes the start and
length of compressed data
stored and aligned in the storage space.
OceanStor
3. Data compaction
If the compressed data is not in 512-byte
100D data
layout aligned mode, zeros will be added to
512 bytes 1.5 KB 2.5 KB 3 KB improve the space utilization.

17 Huawei Confidential
Ultra-Large Cluster Management and Scaling
Compute Adding/Reducing compute nodes
cluster Compute node Compute node Compute node Compute node
• Compute node: 10,240 nodes supported by
TCP, 100 nodes supported by IB, and 768
VBS VBS VBS VBS
nodes supported by RoCE

Network switch

Storage node Storage node Storage node Storage node Storage node

Control Adding/Reducing ZK disks


cluster ZK ZK ZK ZK ZK • Nine nodes supported by ZK

Storage Disk Disk Disk Disk Disk Adding/Reducing compute nodes


pool • Storage node: node/cabinet-level capacity
Disk Disk Disk Disk Disk expansion
• Adding/Deleting nodes in the original storage
Disk Disk Disk Disk Disk pool and creating a storage pool for new
nodes

• Storage node: on-demand Adding disks


Disk Disk Disk
disk expansion

18 Huawei Confidential
All-Scenario Online Non-Disruptive Upgrade
(No Impacts on Hosts)
Block mounting scenario: iSCSI scenario:
Compute node Host

iSCSI initiator
SCSI
Provides only interfaces and
forwards I/Os. The code is • Before the upgrade,
VSC start the backup
simplified and does not need
to be upgraded. process and
Back up the TCP Connect maintain the
VBS
Each node completes the connection. to the connection with the
upgrade within 5 seconds Restore the TCP backup
connection.
shared memory, and
and services are quickly process.
VBS Save the iSCSI complete the
Storage node Storage node taken over. The host connection and upgrade within 5
I/Os.
connection is uninterrupted, Shared seconds.
Restore the
EDS EDS but I/Os are suspended for 5 iSCSI connection memory • Single-path upgrade
seconds. and I/Os.
is supported. Host
Component-based upgrade: connections are not
Components without EDS interrupted, but I/Os
changes are not upgraded, are suspended for 5
minimizing the upgrade seconds.
OSD … OSD
duration and impact on OSD
... ... services.
...

19 Huawei Confidential x
Concurrent Upgrade of Massive Storage Nodes

Application Application Application Application


NAS private
HDFS Obj S3 client Block VBS

Storage pool 1 Storage pool 2


Disk pool Disk pool Disk pool Disk pool

Storage

Storage

Storage
Storage

Storage

Storage
Storage

Storage

Storage

Storage

Storage

Storage

node

node

node
node

node

node
node
node

node

node

node

node
... ... ... ...

32 nodes 32 nodes 32 nodes 32 nodes

Upgrade description:
• A storage pool is divided into multiple disk pools. Disk pools are upgraded concurrently, greatly shortening the upgrade duration of distributed
massive storage nodes.
• The customer can periodically query and update versions from the FSM node based on the configured policy.
• Upgrading compute nodes on a management node is supported. Compute nodes can be upgraded by running upgrade commands.

20 Huawei Confidential
Coexistence of Multi-Generation and Multi-Platform
Storage Nodes
• Multi-generation storage nodes (dedicated storage nodes of three consecutive generations and general storage nodes of two
consecutive generations) can exist in the same cluster but in different pools.
• Storage nodes on different platforms can exist in the same cluster and pool.

Coexistence of multi-generation storage nodes in the same cluster

Pacific V1 Pacific V2 Pacific V3

Federation Cluster Network (Data)

P100
C100

Coexistence of multi-platform storage nodes in the same cluster


21 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

22 Huawei Confidential
CPU Multi-Core Intelligent Scheduling
Group- and Priority-based Multi-Core
CPU Intelligent Scheduling
Traditional CPU Thread
High-performance Kunpeng
Scheduling
multi-core CPU
CORE CORE CORE CORE CORE CORE CORE

CORE CORE CORE CORE CORE CORE CORE

0 1 2 3 4 5 ...
Fast
0 1 2 3 4 5 . processing
. and return of Group 1 Group 2 Group 3 Group N
. front-end
I/Os EDS OSD
EDS OSD Front-end Communication Communication
thread pool thread pool thread pool
Front Back-end Meta Merge Xnet Priority 1 Priority 1
Front Front Xnet

I/O scheduler
Xnet Replication Dedup OSD Back-end Metadata merge I/O thread pool
Priority 2
thread pool thread pool

• CPU grouping is unavailable, and frequent switchover of processes Back-end Meta Merge Xnet
among CPU cores increases the latency.
• CPU scheduling by thread is in disorder, decreasing the CPU Deduplication Replication
Priority 4 Priority 5 Background
efficiency and ultimate performance. thread pool thread pool
• Tasks with different priorities, such as foreground and background I/O adaptive
Dedup Replication speed
tasks, interruption and service tasks, conflict with each other,
adjustment
affecting performance stability.
Advantages of group- and priority-based multi-core CPU intelligent
scheduling:
• Elastic compute power balancing efficiently adapts to complex and diversified service
scenarios.
• Thread pool isolation and intelligent CPU grouping reduce switchover overhead and
provide stable latency.
23 Huawei Confidential
Distributed Multi-Level Cache Mechanism
Multi-level distributed read cache Multi-level distributed write cache
Read Request Hit return
Write WAL Log and
If Miss read from
Write Data Response quickly
Smart Cache
Aggregate and Flush
Semantic-level Data to SSD
1 µs RAM Cache RAM Cache
EDS Meta Cache
BBU RAM Write
I/O model Cache
Metadata Hit return
SCM intelligent RoCache
RAM Cache
10 µs recognition Future
engine SCM POOL
If Miss read from POOL, SSD
Semantic-level or HDD
SSD Cache Smart Cache
SSD Write Cache
100
µs Disk Read
Disk-level Aggregate and Flush
SSD Cache Cache Data to HDD

10 ms HDD HDDs HDDs

Various multi-level cache mechanisms and hotspot identification algorithms


greatly improve performance and reduce latency in hybrid media.

24 Huawei Confidential
EC Intelligent Aggregation Algorithm
Intelligent aggregation EC
Traditional Cross-Node EC
based on append write
LUN 1 LUN 2 LUN 1 LUN 2

A1 A2 A3 A4 ...... B1 B2 B3 B4 ...... ...... A1 A2 A3 A4 ...... B1 B2 B3 B4 ...... ......

Write in place Append Only and


I/O aggregation Small-block write I/O aggregation
unavailable Performance
Intelligent
...... improved by N ......
A1 B1 A3 B6 B9 A5 B5 A8 cache A1 B1 A3 B6 B9 A5 B5 A8
aggregation
times
Stripe1 A1 / A3 / P Q Supplement read/write A2 and A4. New
A1 B1 A3 B6 P Q
Stripe1
/ P Q Supplement read/write B2, B3, and B4.
Stripe2 B1 / /
New Irrelevant to the write address.
Fixed address mapping cannot wait until data B9 A5 B5 A8 P Q Any data written at any time can
Stripe3 B6 / / B9 P Q in the same stripe is written to the full stripe at Stripe2
be aggregated into full stripes
the same time. As a result, the read/write ......
...... amplification is 2 to 3 times of the full stripe.
without extra amplification.

Full stripe New full stripe


A1 A2 A3 A4 P Q A1 B1 A3 B6 P Q
B1 B2 B3 B4 P Q B9 A5 B5 A8 P Q
... ... ... ... ... ... ... ... ... ... ... ...

Node1 Node2 Node3 Node4 Node 5 Node 6 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

Performance advantages of intelligent aggregation EC based on append write:


• Full-stripe EC write can be ensured at any time, reducing read/write network amplification
and read/write disk amplification by several times.
• Data is aggregated at a time, reducing the CPU compute overhead and providing ultimate
peak performance.
25 Huawei Confidential
Append Only Plog
Disks and new media have great performance differences in different I/O patterns.
Random Write (8 KB) Sequential Write (8 KB Aggregation -> Large Size)
Disk Type Performance GC Impact Disk Type Performance GC Impact
HDD 150 IOPS / 1.2 MB/s / HDD 5120 IOPS / 40 MB/s /
SSD 40K IOPS / 312 MB/s Bad SSD 153K IOPS / 1200 MB/s Good
SCM 200K IOPS / 1562 MB/s / SCM 640K IOPS / 5000 MB/s /

The Append Only Plog technology provides the optimal disk flushing performance model for media.
• Plog is a set of physical addresses that are managed based on a fixed size. The
upper layer accesses the Plog by using Plog ID and offset.
• Plog is append-only and cannot be overwritten.
Plog has the following performance
A B ...... A' ...... B' ...... advantages:
Logical address
overwrite
• Provides a medium-friendly large-block
sequential write model with optimal
Cache Linear ...... performance.
A B C D A' E F B'
space
• Reduces SSD global collection pressure by
Data sequential appending.
Write Plog ID + offset
Write new Plogs in
appending mode. • Provides a basis for implementing the EC
Physic Plog 2 ...... intelligent aggregation algorithm.
Plog 1 Plog 3
address space

26 Huawei Confidential
RoCE+AI Fabric Network Provides Ultra-Low Latency

Compute write read


nodes Front-end service network
• Block/NAS private client: RoCE
• Standard object protocol: S3
25 Gbit/s TCP/IP • Standard HDFS protocol: HDFS
100 Gbit/s RoCE • Standard block protocol: iSCSI
• Standard file protocol: NAS/SMB

Front-end NIC Front-end NIC Front-end NIC Front-end NIC

Back-end storage network


Storage Storage Storage Storage • Block/NAS/Object/HDFS: RoCE
Storage Controller Controller Controller Controller • AI fabric intelligent flow control: Precisely
nodes locates congested flows and implements
backpressure suppression without affecting
normal flows. The waterline is dynamically set
Back-end Back-end Back-end SSD/HDD
Back-end to ensure the highest performance without
SSD/HDD SSD/HDD SSD/HDD
NIC NIC
NIC NIC packet loss. The switch and NIC collaborate
to schedule the maximum quota for
congestion prevention. The fluctuation rate of
the network latency is controlled within 5%
100 Gbit/s RoCE + AI without packet loss. The latency is controlled
Fabric within 30 µs.
• Binding links doubles the network bandwidth.

27 Huawei Confidential
Block: DHT Algorithm + EC Aggregation Ensure Balancing and
Ultimate Performance
I/O
DHT algorithm DHT algorithm: (LBA/64 MB)%
......
Service Layer Granularity
• The storage node receives I/O data and distributes the data
(64 MB)
blocks of 64 MB to the storage node using the DHT algorithm.
Node-1 Node-2 Node-3 Node-4 Node-5 Node-6 Node-7

• After receiving the data, the storage system divides the data
Grain (e.g. 8 KB) into 8 KB data blocks of the same length, compresses the
deduplicated data at the granularity of 8 KB, and aggregates
the data.
• The aggregated data is stored to the storage pool.
Logical space of LUN1 LBA Logical space of LUN2 LBA
Index Layer Mapping Between LBAs and Grains of LUNs

LUN1-LBA1 Grain1

LUN1-LBA2 Grain2

LUN1-LBA3 Grain3

LUN2-LBA4 Grain4

Mapping Between LBAs and Grains


Partition ID of LUNs

Grain1
Partition ID
Persistency Layer Grain2
Four grains form
an EC stripe and
Grain3 are stored in the
D1 D2 D3 D4 P1 P2 D1 D2 D3 D4 P1 P2
partition.
Grain4
Node-1 Node-2 Node-3 Node-4 Node-5 Node-6 Node-7
28 Huawei Confidential
Object: Range Partitioning and WAL Submission Improve Ultimate
Performance and Process Hundreds of Billions of Objects
Range WAL
A, AA, AB, ABA, ABB, ............, ZZZ A, AA, AB, ABA, ABB, ............, ZZZ


...
Range 0 Range 1 Range 2 Range n
Node 1 Node 2 Node 3 Node n
Range Partition

1. The WAL mode is used. The foreground records


1. Metadata is evenly distributed in lexicographic order + only one write log operation for one PUT operation.
range partitioning mode. Metadata is cached to SSDs. 2. Data is compacted in the background.
• You can locate the node where the metadata index • The WAL mode reduces foreground write I/Os.
resides based on the bucket name and prefix to • SSDs improve the access speed.
quickly search for and traverse metadata.

29 Huawei Confidential
HDFS: Concurrent Multiple NameNodes Improve
Metadata Performance
Traditional HDFS NameNode model Huawei HDFS Concurrent Multi-NameNode
Hadoop Hadoop
HBase/Hive/Spark HBase/Hive/Spark compute node
compute node

Active Active Active


Active Standby Standby NameNode NameNode NameNode
NameNode NameNode NameNode
DataNode DataNode DataNode

HA based on NFS or
Quorum Journal

• Only one active NameNode provides the metadata service. The active and standby • Multiple active NameNodes provide metadata services, ensuring
NameNodes are inconsistent in real time and have a synchronization period. real-time data consistency among multiple nodes.
• If the current active NameNode becomes faulty, the new NameNode cannot provide the • Avoid metadata service interruption caused by traditional HDFS
metadata service until the new NameNode completes the log loading. The duration is up to NameNode switchover.
several hours. • The number of files supported by multiple active NameNodes is
• The number of files supported by a single active NameNode depends on the memory of a no longer limited by the memory of a single node.
single node. The maximum number of files supported by a single active NameNode is 100 • Multi-directory metadata operations are concurrently performed
million. on multiple nodes.
• When a namespace is under heavy pressure, concurrent metadata operations consume a
large number of CPU and memory resources, deteriorating performance.

30 Huawei Confidential
NAS: E2E Performance Optimization
Private client, distributed cache, and large I/O passthrough (DIO) technologies enable a storage system to
provide high bandwidth and high OPS.
Application Application
① Private client Standard protocol client

Front-end network Key Large Files, High Large Files and Random Massive Small Files,
Peer Vendors
Technology Bandwidth Small I/Os High OPS
Node 3
Protocol Protocol Protocol Multi-channel protocol Intelligent protocol load Small I/O aggregation Isilon and NetApp do
1. Private client
35% higher bandwidth balancing 40% higher IOPS than not support this
Read
② ⑤ (POSIX,
than common NAS 45% higher IOPS than common NAS function. DDN and IBM
cache MPIIO)
protocols common NAS protocols protocols support this function.

Space Space I/O workload model of Peer vendors support


Space
Regular stream prefetch intelligent service only sequential stream
Write ③ Write Write 2. Intelligent Sequential I/O pre-read improves the memory hit learning implements pre-reading and do not
cache cache cache cache improves bandwidth. ratio and performance cross-file pre-reading support interval stream
and reduces the latency. and improves the hit pre-reading or cross-file
ratio. pre-reading.

Index Index Index Stable write latency and Stable write latency Only Isilon and NetApp
3. NVDIMM Stable write latency
I/O aggregation and file aggregation support this function.

4. Block size of
Persistence Persistence Persistence 1 MB block size, 8 KB block size, 8 KB block size,
the self- Only Huawei supports
improving read and write improving read and write improving read and
adaptive this function.
performance performance write performance
application
Back-end network Large I/Os are directly
5. Large I/O read from and written to Only Huawei supports
/ /
passthrough the persistence layer, this function.
improving bandwidth.

31 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

32 Huawei Confidential
Component-Level Reliability: Component Introduction and
Production Improve Hardware Reliability
Component
Joint test
500+ test cases:
System test

Start: authentication/Preliminary
System functions and ERT long-term
Design review test for R&D Qualification test
performance, samples/Analysis on reliability test I: System/disk logs
Disk selection
product applications Locate simple
compatibility with earlier
versions, disk system problems quickly.
reliability II: Protocol analysis

5-level FA analysis
Failure analysis Supplier Locate interactive
Circular improvements
tests/audits during the problems accurately.
About 1000 disks have supplier's production
been tested for three (Quality/Supply/Cooperation) III: Electrical signal
ERT

Online quality data analysis


months by introducing
analysis/warning Locate complex disk
multiple acceleration system
factors. problems.
IV: Disk teardown for
RMA Sampling test
analysis
Qualification test

100+ test cases: Reverse


Functions, performance, Joint test Locate physical
compatibility, reliability, damages of disks.
ORT long-term test Production test V: data restoration
firmware bug test,
for product samples (including
vibration, temperature
after delivery temperature cycle)
cycle, disk head, disk,
special specifications
verification

Based on advanced test algorithms and industry-leading full temperature cycle tests, firmware tests, ERT tests, and ORT tests,
Huawei distributed storage ensures that component defects and firmware bugs can be effectively intercepted and the overall
hardware failure rate is 30% lower than the industry level.

33 Huawei Confidential
Product-Level Reliability: Data Reconstruction Principles
Disk 1
Disk 1 Disk 1 Disk 1 Disk 1 Disk 1

Fault Disk 2
Disk 2 Disk 2 Disk 2 Disk 2 Disk 2

Disk 3
Disk 3 Disk 3 Disk 3 Disk 3 Disk 3
... ... Disk 4 ... ... ...
...
Disk N Disk N Disk N Disk N Disk N
Disk N

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6

30 minutes 10 times faster than 30 minutes Twice faster than other


Distributed
traditional storage distributed storage
Distributed
RAID RAID
1 hour
5 hours Other distributed storage
Traditional
RAID

34 Huawei Confidential
Product-Level Reliability: Technical Principles of E2E
Data Verification
①WP: DIA insertion  Two verification modes:
APP APP APP APP ②WP: DIA verification  Real-time verification: Write requests are
③RP: DIA verification
⑤→④: Read repair from verified on the access point of the system
HDF
remote replication site (the VBS process). Host data is re-verified
Block① ③ NAS S3 ⑥→④: Read repair from
S other copy or EC check on the OSD process before being written to
calculation disks. Data read by the host is verified on the
SWITCH
VBS process.
 Periodical background verification: When
the service pressure of the system is low, the
HDFS
system automatically enables the periodical
background verification and self-healing
EDS ⑤ HyperMetro functions.
⑥  Three verification mechanisms: CRC 32
② ④ OSD
protects users' 4 KB data blocks. In addition,
OceanStor 100D
OceanStor 100D supports host and disk LBA
logical verification to optimize all silent data
scenarios, such as transition, read offset, and
Disk
write offset.
 Two self-healing mechanisms: local redundancy
64 Bytes
4 KB Data Block Data mechanism and active-active redundancy data
512 512 512 512 512 512 512 512 512
Integrity  Block, NAS, and HDFS support E2E verification,
Area but Object does not support this function.
35 Huawei Confidential
Product-Level Reliability: Technical Principles of
Subhealth Management
1. Disk sub-health management 3. Process/Service sub-health management
 Intelligent detection and diagnosis:  Cross-process/service detection 1: If
the I/O access latency exceeds the
Information about Self Monitoring ② specified threshold, an exception is
OSD Analysis and Reporting Technology
① (SMART), statistical I/O latency, real- reported.
time I/O latency, and I/O errors is  Smart diagnosis 2: OceanStor 100D
EDS EDS
② collected. Clustering and slow-disk ① ① ③
diagnoses processes or services with
detection algorithms are used to abnormal latency using the majority
RAID Card MDC diagnose abnormal disks or RAID voting or clustering algorithm based on
MDC the reported abnormal I/O latency of
controller cards.
 Isolation and warning: After diagnosis, each process or service.
the MDC is instructed to isolate  Isolation and warning 3: Report
involved disks and report an alarm. OSD abnormal processes or services to the
DISK MDC for isolation and reports an
alarm.

 Multi-level detection: The local 4. Fast-Fail (fast retry of path switching)


2. Network sub-health management network of a node quickly detects  Ensure that the I/O latency of a single
① exceptions such as intermittent sub-healthy node is controllable.
disconnections, packet errors, and  I/O-level latency detection 1: checks
negotiated rates. In addition, nodes EDS whether the response time of each I/O
SWITCH SWITCH ①
are intelligently selected to send exceeds the specified threshold and
detection packets in an adaptive whether a response is returned for the
manner to identify link latency I/O. If no response is returned, Fast-Fail
exceptions and packet loss. ② is started.
①  Smart diagnosis: Smart diagnosis is  Path switchover retry 2: For read I/Os,
② performed on network ports, NICs, other copies are read or EC is used for
NIC NIC NIC NIC OSD OSD
and links based on networking models recalculation. For write I/Os, new space
and error messages. is used for data write (space is allocated
NODE1 NODE2  Level-by-level isolation and warning: to other normal disks).
Network ports, links, and nodes are
isolated based on the diagnosis DISK DISK
results and alarms are reported.
36 Huawei Confidential
Product-Level Reliability: Proactive Node Fault Notification
and I/O Suspension for 5 Seconds
EulerOS works with the 1822 NIC. When a node is restarted due to the power failure, power-off by pressing the power
button, hardware fault, or system reset, the 1822 iNIC proactively sends a node fault message. Other nodes quickly rectify
the node fault. The I/O suspension time is less than or equal to 5 seconds.

Write key
information in
advance.
Reboot
ZK
Node entry
Power-off process
System Troubleshooting
notification Node fault
System reset Fault
broadcast
Nodes entry notification
1822 NIC 1822 NIC CM
process

Troubleshooting
Unexpected power-off
Power-off interruption
Power off the node by MDC
pressing the power
button.
Troubleshooting
System-insensitive Reset interruption
reset due to hardware
faults

37 Huawei Confidential
Solution-Level Reliability: Gateway-Free Active-Active Design
Achieves 99.9999% Reliability (Block)

 Two sets of OceanStor 100D clusters can be used to construct active-


ERP active access capabilities. If either data center is faulty, the system
automatically switches to the other cluster and therefore, no data will
CRM BI be lost (RP0 = 0, RTO ≈ 0) and upper-layer services will not be
interrupted.
SCSI/iSCSI SCSI/iSCSI  The OceanStor 100D HyperMetro is based on the virtual volume
mechanism. Two storage clusters are virtualized into a cross-site
virtual volume. The data of the virtual volume is synchronized between
the two storage clusters in real time, and the two storage clusters can
process the I/O requests of the application server at the same time.
 The HyperMetro has good scalability. For customers who have large-
scale applications, each storage cluster can be configured with
Synchronous data dual-write multiple nodes. Each node can share the load of data synchronization,
enabling the subsequent service growth.
100 KM
 The HyperMetro supports the third-place arbitration mode and static
priority arbitration mode. If the third-place quorum server is faulty, the
OceanStor 100D A OceanStor 100D B system automatically switches over to the static priority arbitration
mode to ensure service continuity.

38 Huawei Confidential
Key Technologies of Active-Active Consistency
Assurance Design
Data center A Data center B

 In normal cases, the write I/Os are written to both sites concurrently
before it is returned to the hosts to ensure data consistency.
Host Application Host
cluster
 The optimistic lock mechanism is used. When the LBA locations of
Cross-site active-active I/Os at the two sites do not conflict, the I/Os are written to their own
3. Sends I/O2. 1. Sends I/O1.
cluster locations. If the two sites write data to the same LBA address
HyperMetro LUN space that overlaps with each other, the data is forwarded to one of
4. Performs 5. Forwards I/O2 2. Performs dual-write for
the sites for serial write operations to complete lock conflict
dual-write for
I/O2. A conflict
I/O1 and adds a local lock to
the I/O1 range at both ends.
processing, ensuring data consistency between the two sites.
is detected

 If a site is faulty, the other site automatically takes over services


based on the
range lock.
6. Processes
the I/O1 and without interrupting services. After the site is recovered,
then the I/O2.
incremental data synchronization is performed.
HyperMetro HyperMetro
member LUN member LUN  SCSI, iSCSI, and consistency groups are supported.
OceanStor 100D A OceanStor 100D B

Summary: With active-active mechanism, data can be consistent in real time,


ensuring core service continuity.

39 Huawei Confidential
Cross-Site Bad Block Recovery

Host Cross-Site Bad Block Recovery


① The production host reads data from storage
Read I/O array A.
② Storage array A detects a bad block by
1 5 HyperMetro
verification.
③ The bad block fails to be recovered by
reconstruction. (If the bad block is recovered,
HyperMetro LUN the following steps will not be executed.)
2 4 ④ Storage array A detects that the HyperMetro
3 6
pair is in the normal state and initiates a
Bad block request to read data from storage array B.
HyperMetro HyperMetro ⑤ Data is read successfully and returned to the
member LUN member LUN
production host.
⑥ The data of storage array B is used to recover
OceanStor OceanStor the bad block's data.
100D A 100D B

The cross-site bad block recovery technology is Huawei's proprietary technology. It


can be automatically executed.
40 Huawei Confidential
Solution-Level Reliability: Remote Disaster Recovery for Core
Services, Second-Level RPO (Block)
Asynchronous replication provides
99.999% solution-level reliability.
Production Remote DR  An asynchronous remote replication relationship is
center center
established between a primary LUN in the production center
and a secondary LUN in the DR center. Then, initial
synchronization is performed to copy all data from the
WAN
primary LUN to the secondary LUN.
 After the initial synchronization is complete, the
Asynchronous asynchronous remote replication enters the periodic
replication incremental synchronization phase. Based on the
customized interval, the system periodically starts the
synchronization task and synchronizes the incremental data
written by the production host to the primary LUN from the
end of the previous period to the current time to the
secondary LUN.
 Support for an RPO in seconds and a minimum
synchronization period of 10 seconds
 Support for the consistency group
 Support for one-to-one and one-to-many (32)
 Simple management and one-click DR drill and recovery
 A single cluster supports 64,000 pairs, meeting large-scale
cloud application requirements.

41 Huawei Confidential
Data Synchronization and Difference Recording Mechanisms

Replication cluster A Replication cluster B


 Load balancing in replication: Based on the preset
synchronization period, the primary replication cluster
periodically initiates a synchronization task and breaks it
Async Replication
Getdelta Node 1 down to each working node based on the balancing policy.
Node 1
Each working node obtains the differential data generated at
specified points in time and synchronizes the differential
data to the secondary end.
Node 2 Getdelta
Node 2
 When the network between the replication node at the
production site and the node at the remote site is abnormal,
data can be forwarded to other nodes at the production site
Node 3 Getdelta Node 3 and then to the remote site.

 Asynchronous replication has little impact on system


Node 4 Getdelta performance because no differential log is required. The
Node 4
LSM log (ROW) mechanism supports differences at
multiple time points, saving memory space and reducing
the impact on host services.

Summary: Second-level RPO and asynchronous replication without differential logs (ROW mechanism) are
supported, helping customers recover services more quickly and efficiently.

42 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

43 Huawei Confidential
Out-of-the-Box Operation
Install the hardware.

Install the software.


 In-depth preinstallation is complete
Configure the temporary before delivery.
network.

Install OceanStor 100D.

 Automatic node detection, one-click


Scan nodes and configure the network configuration, and
network. intelligent storage pool creation are
supported, making services
Create a storage pool. available within 2 minutes

 You can install a large number of


nodes in one-click mode or install
Install compute nodes in Manually install compute them manually or by compiling scripts.
batches. nodes.

44 Huawei Confidential
AI-based Forecast
1. eService uses massive data to train AI algorithms online, and mature algorithms are adapted to devices.
2. AI algorithms that do not require large-data-volume training are self-learned in the device.

eService Capacity Run AI algorithms on devices based on historical capacity


forecast information to forecast risks one year in advance.
1

Result output

AI Algorithms
2
Risky disk Run AI algorithms on devices based on disk logs to forecast
forecast risks half a year in advance.
Data input

Disk
Capacity
information
collection
collection

DISK

45 Huawei Confidential
Hardware Visualization
Visualized Hardware-based modeling and layered display of global hardware enable second-
hardware level hardware fault locating.

46 Huawei Confidential
Network Visualization
Visualized Based on the network information model, collaborative network management is
networks achieved, supporting network planning, fault impact analysis, and fault locating.

• The network topology can be


displayed by VLAN and service
type, facilitating unified
network planning.
• Network visualization displays
network faults, subhealth
status, and performance,
associates physical network
ports, presents service
partitions, and helps customers
to identify the impact scope of
faults or subhealth for efficient
O&M.

47 Huawei Confidential
Three-Layer Intelligent O&M
Data Center eService Cloud
OceanStor DJ Smartkit eSight
Storage resource control Storage service tool Storage monitoring and
management eService
Fault Performance
Resource Service Delivery Upgrade Troubleshooting monitoring report
Intelligent Intelligent
provisioning planning maintenance analysis
Storage Correlation
Log analysis Inspection tool subnet analysis and platform platform
topology forecast

HDFS maintenance Platform

REST REST/CLI REST/SNMP/IPMI

https/email Remote monitoring Troubleshooting


OceanStor 100D
Asset Capacity
management forecast
TAC/GTAC
DeviceManager expert
Single-device management software
CLI Health evaluation
Automatic work
order creation
Expand/Reduce Capacity Cluster Single-device
AAA Deployment Upgrade
capacity by servers forecast management
management Massive Global
Monitoring Configuration Inspection
License Risky disk Service software maintenance data knowledge base R&D
management forecast management expert

48 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

49 Huawei Confidential
Flexible and General-Purpose Storage Nodes

• P/F100 2 U rack storage node: • C100 4 U rack storage node:

2*Kunpeng 920 CPU 2*Kunpeng 920 CPU

12/25 Slot Disk or 24 Slot NVMe 36 Slot Disk

GE / 10GE / 25GE / 100 Gbit/s IB GE / 10GE / 25GE / 100 Gbit/s IB

Name Node Node Type Hardware Platform


C100 Capacity TaiShan 200 (Model 5280)
OceanStor 100D P100 Performance TaiShan 200 (Model 2280)
F100 Flash TaiShan 200 (Model 2280)

50 Huawei Confidential
New-Generation Distributed Storage Hardware
Front view Rear view

High-density
HDD
(Pacific)

5 U, 2 nodes, 60 HDDs per node, 120 Two I/O cards per node, four onboard
HDDs in total 25GE ports

High-density
all-flash
(Atlantic)

5 U, 8 nodes, 10 SSDs per node, 80 Two I/O cards per node, a total of 16 I/O cards
SSDs in total Two switch boards, supporting eight 100GE
ports and eight 25GE ports

51 Huawei Confidential
Pacific Architecture Concept: High Disk Density, Ultimate TCO, Dual-Controller
Switchover, High Reliability, Smooth Data Upgrade, and Ever-New Data
• Two nodes with 120 disks provide 24 disk slots per U, with industry-leading density and the lowest TCO per disk slot in the industry.
• The dual-controller switchover design of vnode eliminates data reconstruction in the case of controller failures, ensuring service continuity and
high reliability.
• FRU design for entire system components, independent smooth evolution for each subsystem, one entire system for ever-new data

PCIe card

PCIe card

PCIe card

PCIe card
25G

25G
GE

GE
1*

4*

4*

1*
Controller

FAN

BMC 1620 1620 BMC

SAS/PCIE SAS/PCIE

PWR
Exp Exp Exp Exp Exp Exp Exp Exp

Power supply unit

Power supply unit


SSD

System disk
System disk
Fan module

Fan module
HALFPALM
HALFPALM

HALFPALM

HALFPALM

HALFPALM
HALFPALM

HALFPALM

HALFPALM
BBU

BBU

BBU
X15 X15 X15 X15 X15 X15 X15 X15
Expander

VNODE 1 VNODE 2 VNODE 3 VNODE 4 VNODE 5 VNODE 6 VNODE 7 VNODE 8

52 Huawei Confidential
Pacific VNode Design: High-Ratio EC, Low Cost, Dual-Controller
Switchover, and High Reliability
• Vnode design supports large-scale EC and provides over 20% cost competitiveness.
• If a controller is faulty, the peer controller of the vnode takes over services in seconds, ensuring service continuity and improving reliability and availability.
• EC redundancy is constructed based on vnodes. Areas affected by faults can be reduced to fewer than 15 disks. A faulty expansion board can be replaced online.

Secondary controller service takeover in the case of a faulty controller Vnode-level high-ratio EC N+2
Enclosure 1 Enclosure 2

Controller A Controller B Controller A Controller B PCIe PCIe 4* PCIe PCIe 4*


card card 25G card card 25G

X
Compute virtual Compute virtual Compute virtual Compute virtual
unit unit unit unit
1620 1620

VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD
E E E E E E E E E E E E E E E E
SAS/PCIE
Single-controller SSD Single-controller SSD
VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1
SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD

HALFPALM
HALFPALM

HALFPALM

HALFPALM

HALFPALM
HALFPALM

HALFPALM

HALFPALM
15 x HDDs 15 x HDDs
X15 X15 X15 X15 X15 X15 X15 X15

Dual-controller takeover

1. When a single node is faulty, SSDs are taken over by the secondary controller physically
SAS
1. The SSDs of a single controller are distributed to four vnodes based on the expander
within 10 seconds, and no reconstruction is required. Compared with common hardware, the
unit. A large-ratio EC at the vnode level is supported.
fault recovery duration is reduced from days to seconds.
2. At least two enclosures and 16 vnodes are configured. The EC ratio can reach 30+2,
2. The system software can be upgraded online and offline without service interruption.
which is the same as the reliability of general-purpose hardware.
3. Controllers and modules can be replaced online.
3. Expansion modules (including power supplies) and disk modules can be replaced
online independently.
53 Huawei Confidential
Atlantic Architecture: Flash Storage Native Design with
Ultimate Performance, Built-in Switch Networking, Smooth
Upgrade, and Ever-New Data
• Flash storage native design: Fully utilize the performance and density advantages of SSDs. The performance can reach 144 Gbit/s and up to 96 SSDs are supported, ranking
No.1 in the industry.
• Simplified network design: The built-in 100 Gbit/s IP fabric supports switch-free Atlantic networking and hierarchical Pacific networking, simplifying network planning.
• The architecture supports field replaceable unit (FRU) design for entire system components and independent smooth evolution f or each subsystem, achieving one entire system
for ever-new data.

Key Specifications Requirements Specifications


System model 5 U 8-node architecture, all-flash storage
Medium disk slot An enclosure supports 96 x half palms.
An enclosure supports 16 interface modules, compatible with half-
Interface module
height half-length standard PCI cards.
An enclosure supports two switch boards, eight controllers in the
Switch enclosure can be shared, and three enclosures can be connected
without switches.
A node supports one Kunpeng 920 processor, eight DIMMs, and 16
Node
GB backup power.

I/O I/O I/O I/O I/O I/O I/O I/O


Front view Rear view
CPU CPU CPU CPU
...... ...... ...... ...... ......

Built-in IP Fabric Built-in IP Fabric

54 Huawei Confidential
Native Storage Design of Atlantic Architecture
Native all-flash design:
• Matching between compute power and SSDs: Each Kunpeng processor connects to a small number of SSDs to fully utilize SSD
performance and avoid CPU bottlenecks.
• Native flash storage design: Flash-oriented half-palm SSD with a slot density 50% higher than that of a 2.5' disk and more efficient heat
dissipation

Traditional architecture design Atlantic architecture is designed for high-performance and high-
density application scenarios.
1. The ratio of CPUs to flash
Half PALM SSD
storage is unbalanced,
1. The parallel backplane design improves the
causing insufficient
backplane porosity by 50% and reduces power
utilization of the flash
consumption by 15%. The compute density
storage.
and performance are improved by 50%.
2. A 2 U device can house a
maximum of 25 SSDs, causing
the difficulty in improving the 2. The half-palm design for SSDs reduces the
medium density. thickness, increases the depth, and
3. The holes on the vertical improves the SSD density by over 50%.
backplane are small,
requiring more space for
CPU heat dissipation. 3. Optimal ratio of CPUs to SSDs: A single
4. 2.5-inch NVMe SSDs are enclosure supports eight nodes, improving
evolved from HDDs and are storage density and compute power by 50%.
not dedicated to SSDs.

55 Huawei Confidential
Atlantic Network Design: Simplified Networking Without
Performance Bottleneck 4 nodes, 100GE per node 4 nodes, 100GE per node
BMC
Switch-free design:
• 24-node switch-free design eliminates network bandwidth bottlenecks.
• With switch-free design, switch boards are interconnected through 8 x 6603 6603 Adopt the
100GE ports. Pacific
architecture to
• The passthrough mode supports switch interconnection. The switch implement
board ports support the 1:1, 1:2, or 1:4 bandwidth convergence ratio 4 x 25GE 4 x 25GE tiered storage.

and large-scale flat networking.

24-node switch-free networking Large-scale cluster switch networking


Atlantic Atlantic Atlantic
Atlantic Atlantic Atlantic

16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
100GE 100GE
100GE 100GE 100GE 100GE

100GE 100GE
100GE 100GE 100GE 100GE

4*25GE
4* 50GE 4* 50GE
4*25GE
4* 50GE 4* 50GE

SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603

100GE
100GE

QSFP-DD
QSFP-DD
4 x 100GE 4 x 100GE 2x
2x 2x
100 100 100
4 x 100GE GE GE GE
64 port 100GE Switch

56 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem

Ultimate Ultimate Hardware


Ecosystem Efficiency Reliability

57 Huawei Confidential
Block Service Solution: Integration with the VMware Ecosystem

Virtualization & Cloud vCenterPlugin and VASA (Vvol)


integration: provides vCenter-based
VMware vCloud Director Management integration management, enhances VVol-
based QoS and storage profile
VMware vRealize capabilities, and supports VASA resource
Operations management in multiple data centers.
vROps (monitoring)
VMware vCenter Server
Integrates with vROps to provide O&M
vCenter Plugin (storage VASA (VVOL capabilities based on storage service
VMware vRealize performance, capacity, alarms, and
integration management) feature) Orchestrator topologies, implementing unified O&M
vRO/vRA (workflow) management.

Rest Rest Enhances vRO & vRA integration and


VMware vCenter supports replication, active-active,
Site Recovery Manager snapshot, and consistency group
capabilities based on OceanStor 100D
8.0.
SRM (DR)
OceanStor 100D
Interconnects with SRM to support
replication, active-active switchover,
REST switchback, and test drills and provides
end-to-end solutions.

58 Huawei Confidential
Block Service Solution: Integration with OpenStack
OceanStor 100D Cinder Driver architecture: OpenStack P release Cinder Driver API(Must Core-Function):
OpenStack
OpenStack P release
Cinder API
OpenStack volume driver API OceanStor 100D
package create_volume Yes
Cinder Scheduler (OpenStack)
delete_volume Yes
Cinder Volume create_snapshot Yes
delete_snapshot Yes
STaaS or eSDK
get_volume_stats Yes

Rest API OpenStack plugin create_volume_from_snapshot Yes


(Huawei- create_clone_volume Yes
OceanStor 100D distributed storage developed)
extend_volume Yes
copy_image_to_volume Yes

Applicable version: copy_volume_to_image Yes


attach_volume Yes
Community Edition Red Hat
detach_volume Yes
OpenStack KiloOpenStack
Red Hat OpenStack Platform initialize_connection Yes
MitakaOpenStack NewtonOpenStack
QueensOpenStack RockyOpenStack Stein 10.0
terminate_connection Yes

59 Huawei Confidential
Block Service Solution: Integration with Kubernetes
Kubernetes Master
Process for Kubernetes to use the OceanStor 100D CSI
driver-register plug-in to provide volumes:
① ②
CSI plugin ① The Kubernetes Master instructs the CSI plug-in to
external-provisioning create a volume. The CSI plug-in invokes an
OceanStor 100D interface to create a volume.
② The Kubernetes Master instructs the CSI plug-in to
map the volume to the specified node. The CSI plug-in
Kubernetes Node
invokes the OceanStor 100D interface to map the
volume to the specified node host.
driver-register Container SCSI/iSCSI
OceanStor 100D ③ The target Kubernetes node to which the volume is
/mnt mapped instructs the CSI plug-in to mount the volume.
external-attaching The CSI plug-in formats the volume and mounts the
volume to the specified directory of the Kubernetes.
CSI plugin Block dev

CSI plug-in assistance service provided by the Kubernetes community

Kubernetes Node Deployed on all Kubernetes nodes based on the CSI specifications to
CSI plugin complete volume creation/deletion, mapping/unmapping, and
driver-register external-attaching mounting/unmounting.

Management plane
CSI plugin Data plane

60 Huawei Confidential
NAS HPC Solution: Parallel High Performance, Storage and Compute
Collaborative Scheduling, High-Density Hardware, and Data Tiering
HPC Application Compute Farm(50000+) Parallel high-performance client
I/O Libraries • Compatible with POSIX and MPI-IO: Meets the requirements of the
Parallel
high-performance compute application ecosystem and provides the
MPI-IO
Serial ...... MPI optimization library to contribute to the parallel compute
community.
POSIX Login & Job • Parallel I/O for high performance: 10+ Gbit/s for a single client and
Parallel File System 4+ Gbit/s for a single thread
Scheduler
Client • Intelligent prefetch and local cache: Supports cross-file prefetch and
local cache, and meets high-performance requirements of NDP.
Front-end • Large-scale networking: Meets the requirements of over 10,000
compute nodes.
network
Collaborative scheduling of storage and compute resources
• Huawei scheduler: service scheduling and data storage
collaboration, data pre-loading to SSDs, and local cache
OceanStor 100D File • QoS load awareness: I/O monitoring and load distribution avoid
scheduling based on compute capabilities and improve overall
Rack1 Rackn Rackn+1 Rackm compute efficiency.
High-density customized hardware
• All-flash Atlantic architecture: 5 U, 8 controllers, 80 disks, 70 Gbit/s
IOPS per enclosure @ 1.6 million IOPS
• High-density Pacific architecture: 5 U 120 disks, supporting online
... ... maintenance and providing a high ratio EC to ensure high disk rate.
• Back-end network free: With the built-in switch module, the back-end
network does not need to be deployed independently if the number
of enclosures is less than 3.
Data tiering for cost reduction
• Single view and multiple media pools: Ensure that the service view
will not be changed and applications will not be affected by the
change.
SSD Pool HDD Pool • Traditional burst buffer free: The all-flash optimized file system
simplifies the deployment of a single namespace and provides
Back-end higher performance especially for metadata access.
network
61 Huawei Confidential
HDFS Service Solution: Native Semantics, Separation of Storage and
Compute, Centralized Data Deployment, and On-Demand Expansion

FusionInsight HORTONWORKS HBase Hive Cloudera FusionInsight HORTONWORKS HBase Hive Cloudera

HDFS applications HDFS applications

HDFS component HDFS component


Hadoop compute cluster
Hadoop cluster CPU CPU CPU CPU
......
Memory Memory Memory Memory

Management node Compute nodes Compute nodes Compute node Compute nodes
CPU CPU CPU

Memory Memory
......
Memory Native HDFS semantics
Management node Storage Storage Storage
system system system Distributed storage cluster
Compute/Storage Compute/Storage Compute/Storage
node node node ......

• Based on the general x86 server and Hadoop software, a • OceanStor 100D HDFS is used to replace the local HDFS
compute/storage node is used to access the local HDFS. of Hadoop. Using native semantics for interconnection,
The compute and storage resources can be expanded storage and compute resources are decoupled. Capacity
concurrently. expansion can be performed independently, facilitating on-
demand expansion of compute and storage resources.

62 Huawei Confidential
Object Service Solution: Online Aggregation of Massive
Small Objects Improves Performance and Capacity
When the storage file object is smaller than the system strip, a large number of space fragments are generated,
which greatly affects the space usage and access efficiency.
Object data aggregation
Obj1 Obj2 Obj3 Obj4 Obj5 Obj6
• Incremental EC aggregates small objects
... into large objects without performance loss.
• Reduce storage space fragments in massive
Strip1 Strip2 Strip3 Strip4 Parity 1 Parity 2 small file storage scenarios, such as
EC 4+2 is used as an example.
government big data storage, carrier log
512K
retention, and bill/medical image archiving.
Performs incremental EC after small objects are aggregated into large objects
• Improve the space utilization of small objects
from 33% (three copies) to over 80% (12+3).
• SSD cache is used for object aggregation,
SSD Cache SSD Cache improving the performance of a single node
by six times (PUT 3000 TPS per node).
......
... ...

Node 1 Node 2

63 Huawei Confidential
Unstructured Data Convergence Implements
Multi-Service Interworking
O&M Scenario: gene sequencing/oil Scenario: log retention/operation Scenario: backup and
• Converged storage pool
survey/EDA/satellite remote analysis/offline analysis/real-time archival/resource pool/PACS/check
Management sensing/IoV search image Three types of storage
services share the same
storage pool, reducing
NAS HDFS S3 initial deployment costs.

Feature Feature Feature • Multi-protocol


Out of the box
• Private client • Private client • Metadata retrieval interoperation
• Quota • Quota • Quota Multiple access interfaces
for one copy of data,
• Snapshot • Snapshot • 3AZ eliminating the
Intelligent O&M • Tiered storage • Tiered storage • Tiered storage requirement for data flow
• Asynchronous replication • Rich ecosystem components • Asynchronous replication and reducing TCO
• Shared dedicated
FILE HDFS OBJECT hardware
One set of dedicated
Ecosystem hardware can be used for
Variable-length Storage and flexible deployment of
compatibility Converged Multi-protocol Unstructured three types of storage
deduplication and compute
resource pool interoperation data base services, compatible with
compression collaboration hardware components of
different generations,
implementing ever-new
data storage.

High performance High reliability • Co-O&M management


Low cost High-density performance: Dedicated hardware, fast
Ultra-high capacity density: HDD One management
All-flash nodes provide 20 sub-health isolation, and interface integrates
Dedicated nodes provide 24 disks per U. Gbit/s per U. fast failover multiple storage services,
hardware achieving unified cluster
Competitive hardware differentiation and O&M management.
Integrated software and hardware, non-volatile memory, Huawei-developed SSD and NIC, and customized distributed
storage hardware increase differentiated competitiveness.

64 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

Das könnte Ihnen auch gefallen