Beruflich Dokumente
Kultur Dokumente
Storage CTO
The Cutting Edge of Storage Innovation
2
Primary Storage Leader in Gartner Magic Quadrant
2018 MQ for General-Purpose Arrays 2019 MQ for Primary Storage 2018 MQ for Solid-State Arrays
“Huawei has progressively become one of the leading providers of primary storage on the global stage.”
“Its external enterprise storage portfolio for primary storage workloads - OceanStor - spans all market segments”
“Huawei announced new versions of OceanStor Dorado6000 V3 and Dorado18000 V3 that support internal NVMe SSD ”
“Huawei’s SmartVirtualization plus SmartMigration software enables users to nondisruptively migrate data from competitive external enterprise storage systems
to OceanStor, or to migrate from an older OceanStor platform to a new OceanStor platform. ”
— Quote by Gartner
3
The Cutting Edge of Storage Innovation
4
OceanStor Dorado Product Portfolio
5
FlashLink ® - The Foundation of Evolution
Multi-Protocol Network Chip: Hi1822 AI Chip: Ascend 310 Processing power requirement
• Support both FC and Ethernet • AI SoC for small scale training • > 1 TeraFLOPS for real-time analytics
Real-time analytics
• Data Correlations;Data Similarity;
BMC Chip: Hi1710 Adaptive Optimization;Health Analytics;
• Troubleshooting accuracy 93% Data Temperature;Failure Prediction
Use case
• Intelligent Cache
Array Controller Chip: Kunpeng 920 SSD Controller Chip: Hi1812e
• Smart QoS
• SPECint 930+, #1 performance ARM • Half the latency of previous model
processor • Intelligent Data Dedup
• Processor embedded intelligent disk • ……
enclosure
6
Kunpeng® CPU - The Heart of New Storage
Submission Completion
Queue Queue
48
Core
Submission Completion
Queue Queue
Submission Completion
Queue Queue
7
SmartMatrix - Symmetric A/A Controller Architecture
Engine Engine
Shared Shared Shared Shared Shared Shared Shared Shared
Frontend Frontend Frontend Frontend Frontend Frontend Frontend Frontend
RDMA
Network
• Symmetric active/active controller with fully • Persistent cache mirroring with max of 3 copies • End to end NVMe support
meshed topology • Non-disruptive firmware upgrade, IO hang-up • Backend RDMA network over 100Gb/s Ethernet.
• Shared everything architecture from frontend, time is limited within 1 second • SCM support for read acceleration*
backend, to drive enclosure
8
OceanStor Dorado - New Gen of Mission Critical Storage
10
Every Second is Valuable
11
Time is Money
$5,000,000
$5,000,000
$4,300,000
$4,000,000
$3,400,000
$3,000,000
$3,000,000
$2,400,000 $2,500,000
$2,000,000
$1,300,000 $1,240,000
$970,320 $913,242
$1,000,000
$0
12
One-Second Controller Failover
Solution A Solution B
FE FE FE FE FE FE FE FE
Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl
BE BE BE BE BE BE BE BE
IOPS
IOPS
IOPS
4S 6S 6S 9S >9S 1S 1S 1S
* The above figures are referring to the testing result in Huawei lab.
13
One-Second’s Magic - Shared Frontend Adapter
Engine Server
Shared Shared Shared Shared
Frontend Frontend Frontend Frontend
14
Multiple Controller Fault Tolerance
FE Front-End Adapter
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
BE BE BE BE BE BE BE BE BE BE BE BE
15
Best-of-Breed Reliability & Availability
Solution A Solution B
• Frontend adapter can’t be shared • Frontend adapter can be shared to all • Frontend adapter can be shared to
between controllers in one engine. the controllers in one engine. all the controllers in one engine.
• LUN has to be owned by a single • LUN ownership is eliminated.
controller.
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl
Ctrl Ctrl Ctrl Ctrl
BE BE BE BE BE BE BE BE BE BE BE BE
• SSD enclosure can be shared by all • SSD enclosure can be shared by all • SSD enclosure can be shared by all
controllers in one engine. controllers in one engine. controllers in multiple engines
16
Firmware Non-Disruptive Upgrade (NDU)
Storage Firmware
Service • 94% of firmware components
1S • Modular design
• Online upgrade
• One-second to active
Manage
Data …… • No connection loss with server
ment
Inter- • Transparent to application
Protocol Control Commun
ication
17
Intelligent DAE - The SSD Shelf with Processing Power
Engine
Storage Controller Offloading
FE FE FE FE • Each DAE has two controllers, and each controller has its own
processor, cache and adapter.
Data Erasure Data
Ctrl Compression Ctrl Coding Ctrl Rebuilding Ctrl • DAE controller takes over some of workloads from array controller,
including:
BE BE BE BE - Data rebuilding
- Erasure Coding (EC)*
- Data compression*
DAE
With the help of DAE controller, DAE is much more intelligent than
Intelligent DAE Controller Intelligent DAE Controller
ever, this distributed computing design half the data rebuilding time,
and reduce the performance impact (max. IOPS) to controller from
15% to 5%, the bandwidth of data rebuilding is increased from 80MB/s
to 200MB/s while the array controller’s CPU Util% remains at 70%.
18
Comprehensive HA/DR Solutions
Asynchronous
Asynchronous
A B A B
VMware VSphere
Fusionsphere WAN WAN
Cluster
...
C C
A B A B
Synchronous Mirroring
C C
IP IP
Network Network Parallel
Production Storage Production Storage Synchronous Synchronous
HyperMetro HyperMetro
Asynchronous
Asynchronous
Site C A B A B
Standby
WAN WAN
Quorum Server C C
Star
19
More Robust Storage HA Cluster
Scenario #1 Scenario #2 Scenario #3
Solution A
#1 #1
#2 #2
Scenario #4 Scenario #5 Scenario #6 #3 #3
#4 #4
#5 #5
Scenario #7 Scenario #8 Scenario #9
#6 #6
#7 #7
#8 #8
Storage Witness/Quorum
#9 #9
20
Extreme Performance Experience
21
Extreme Performance Experience
DB Acceleration VM Delivery VDI Support
Transaction Per Second
7200
57,000 1.5 Minutes VDI
52 Minutes
1500
VDI
11,500
3.84TB SSD*40, SwingBench OE2 transaction generator 100 VM clone, 50GB each 3.84TB SSD*100 with data reduction
10% 10%
0%
43.4% Test Case: Mixed workload, 8K, 7:3/1ms average latency/8x LUN, 32x outstanding 78.9% 0%
23
End-to-End Load Balancing
I/O I/O I/O I/O
Global Cache
Storage Storage Storage Storage
Controller Controller Controller Controller • Write I/O requests for single LUN can be placed into cache space from
multiple storage controllers.
Backend Backend Backend Backend • For better cache read hit, storage controller can place the prefetched data in
Adapter Adapter Adapter Adapter
global cache for potential read requests from any front-end link
24
CPU Resource Dynamic Scheduling
Slice#9 processing
Storage Controller Core Core Core • CPU Core Grouping
Slice#8
I/O Read I/O Read
Slice#7 & Write #2 CPU cores will be divided into
Core Core
multiple groups, each group will be
Slice#6
assigned with a specific job.
Slice#5 Storage Controller Core Core Core
Data I/O Write • Dynamic Scheduling
Slice#4
…… Flushing #1
Core Core Higher priority jobs can acquire more
Slice#3 core from shared core groups.
Slice#2 Storage Controller Core Core Core • Workload Isolation
Data IO Write
Slice#1 Reduction #2 Each CPU core has its own I/O
Core Core
Slice#0 request to process to avoid interlock
25
Powered by NVMe and RoCE
Server The Latest Protocol & Network Standard
• 50% of latency reduction can be done with the latest
protocols (NVMe & RoCE v2)
RoCE
/FC Front-End/Back-End Adapter Protocol Offloading
Engine • 10% of latency reduction can be done with:
Frontend Frontend Frontend Frontend
Adapter Adapter Adapter Adapter ‐ Self-developed TOE frontend adapter chip.
‐ ASIC IO balancing/distribution
50us
30us
Storage Storage Storage Storage
Controller Controller Controller Controller
Intelligent DAE and Self-Developed SSD
• Read priority technology: Read requests on SSDs are
Backend Backend Backend Backend
Adapter Adapter Adapter Adapter preferentially executed to respond to hosts in a timely
manner. The latency in hybrid scenarios is reduced by 20%.
100us
30us
DAE • 30% of performance improvement for SAS DAE connection
multiplexing technology
26
Load Balancing - Core Demand of Mission Critical Application
Database
Activelog Activelog Activelog Activelog Activelog Activelog Activelog Activelog
Bank X Case Study
#1 #2 #3 #4 #5 #6 #7 #8
Activelog Activelog Activelog Activelog Activelog Activelog Activelog Activelog Database Log Switching
#9 #10 #11 #12 #13 #14 #15 #16
• The customer in FSI was using DB2 for core banking application, 24 active
Activelog Activelog Activelog Activelog Activelog Activelog Activelog Activelog
#17 #18 #19 #20 #21 #22 #23 #24 logs were activated in circular mode, one log per hour.
• In each hour, only one active log was busy.
DKC
LUN Ownership
FED FED FED FED
• The customer enabled “DB2 full logging” for potential problem analysis,
therefore, the workload got much higher than before.
VSD VSD VSD VSD
• The storage also met performance issue accordingly, because of the
BED BED BED BED mapping relationship between active logs and storage controllers were
fixed either (also known as: LUN ownership).
• The I/O workload on specific storage controller can’t be shared by other
DKU
controllers, unbalanced workload led to performance bottleneck.
27
Processor-Level Load Balancing
Data
Solution B Data
FE FE FE FE FE FE FE `FE
Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl Ctrl
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
Proc
BE BE BE BE BE BE BE BE
SSD Enclosure (DKU owned by DKC #0) SSD Enclosure (DKU owned by DKC #1) SSD Enclosure (shared) SSD Enclosure (shared)
• Owner controller has to take most of the workload. • Workload can be spread out across all the controllers owned by the
• The 2nd engine cannot be involved in load balancing. 1st and 2nd engine at processor level.
• SSD enclosures owned by the 2nd engine cannot be shared to the • SSD enclosures are shared between engines via RDMA network,
1st engine, therefore, write I/O flushing is constrained within one and data could be flushed from both engines.
engine.
28
Business Always-On
29
Business Always-On with Lower TCO
Cost
Traditional Solution
Huawei Solution
FlashEver: FlashEver:
Replace Intermix of
Controller various gens
Module Only of DAE
No Downtime No Downtime
70% max
90+%
Traditional Solution
Huawei Solution
No Data
Less Data
Migration
Migration
No Cabling
30
Non-Disruptive Tech. Refresh
• Support controller upgrade with Non- • Up to 128 controllers • Virtualize third party storage by taking over
Disruptive, even include next several • Support OceanStor Dorado and the following the access path
generations by 10 years generations • Reuse old storage to protect customer
• Tech refresh the existing assets to obtain the • Can mix different gens of OceanStor Dorado investment
advantages of the latest technology in one federation cluster • Smoothly cutover the business to run in
• Support data mobility non-disruptively, and OceanStor Dorado and the following
online node reorganization generations
31
FlashEver & Storage Federation Use Case
32
Incomparable Flexibility for The Next Decade
Solution A Solution B
Cost
Traditional Solution
Huawei Solution
Only available for VSP G1000 upgrade to Replace old storage controller module
VSP G1500, a temporary design. only, protest investment as much as
Upgrade
Initial Purchase Upgrade Tech Refresh possible
Upgrade Upgrade
Year Year
33
Wrap Up
Strong Capability of Intelligent Chips Development Symmetric A/A Storage Controller Architecture
• Array Controller • Shared Frontend & Backend Adapter
• BMC • Fully-Meshed Topology
• Multi-Protocol Chip (FE/BE Adapter) • No LUN-Ownership
• SSD Controller • Cross Engine Load Balancing
• AI Chip
34
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.
Dorado8000 V6
Dorado6000 V6
Dorado3000 V6 Dorado5000 V6
2 Huawei Confidential
OceanStor Dorado V6: The Cutting Edge of Storage Innovation
3 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency
4 Huawei Confidential
New Generation Innovative Hardware Platform
Extremely Reliable Extreme Performance Cost Effective
High-end
controller
enclosure
Mid-range
controller
enclosure
2U, 2 controllers per controller enclosure
2U, 36 NVMe SSDs(high density)
Entry-level
controller
enclosure
2U, 2 controllers per controller enclosure
6 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
IO card IO card
CPU CPU
SSD
100GE 100GE
SSD
IOB IOB
IO card SSD IO card
CPU SSD CPU
IO card IO card
100GE
FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN
7 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
100GE
FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN FAN
8 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
9 Huawei Confidential
Same capacity, width reduced by 36%
Extremely Reliable Extreme Performance Cost Effective
10 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
48 Core
High integration Not only computing
Acceleration
8-Channel
100G RoCE
& SAS 3.0
PCIe 4.0
Engine
DDR4
11 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Shared Shared Shared Shared Shared Shared Shared Shared End-to-End NVMe
Backend Backend Backend Backend Backend Backend Backend Backend
• Front-end: NVMe over FC(32G)/NVMe over Fabrics(RoCE).
Controller Enclosure Controller Enclosure
• Back-end: NVMe SSD/Intelligent DAE.
12 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency
13 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
controller controller
48 48 48 48
Full Mesh
A B A B A B A B cores cores cores cores interconnection
48 48 48 48 between all controllers
C D C D C D C D cores cores cores cores
in each engine
48 48 48 48
SharedShared Back-End
Shared SharedShared Back-End
Shared cores cores cores cores Shared interconnection
Back-End Back-End Back-End Back-End 48 48 48 48 module for connecting
cores cores cores cores
between the engines
14 Huawei Confidential
Intelligent DAE
Extremely Reliable Extreme Performance Cost Effective
15 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Shared Front-End Shared Front-End Shared Front-End Shared Front-End Shared Front-End Shared Front-End
A A’ A’’ A A’ A’’ A A’
B’ B B’’ B’ B B’’ B’ B
C C’ C’’ C C’ C’’ C C’
D’ D D’’ D’ D D’’ D’ D
E’’ E E’ E’’ E E’ E E’
F’’ F’ F F’’ F’ F F’ F
G’’ G G’ G’’ G G’ G G’
H’’ H’ H H’’ H’ H H’ H
A B C D E F G H A B C D E F G H A B C D E F G H
Shared Back-End Shared Back-End Shared Back-End Shared Back-End Shared Back-End Shared Back-End
• Global Cache supports 3 copies across two engines. • Global Cache supports 3 copies across two engines. • Global cache provides continuous mirroring technology
• Guarantee at least 1 cache copy available if 2 controllers failed • One disk enclosure can be accessed by 8 controllers(2 • Tolerates 7 controllers failure one by one of 8 controllers(2
simultaneously. engines) through the shared back-end module engines)
• Only one engine can also tolerate 2 controllers failure at the • Guarantee at least 1 cache copy available if one engine failed.
same time with 3 copies Global Cache
16 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
The best Active-Active design ① 3 Copied Cache ② RoCE network ③ Intelligent DAE
Front-End
Controller
Back-End
Disk enclosure shared with dual- Disk enclosure shared with four- Global cache provides continuous mirroring
controller, dual-controller(one engine) controller, 4 controller failure(one technology and 3 copies across 2 engines
failure causes service interruption. engine) causes service interruption. Disk enclosure shared with 8 controllers.
No service interruption: any 2 controllers failed
at the same time; 1 engine failed; 7 controllers
17 Huawei Confidential failure one by one of 8 controllers(2 engines)
Extremely Reliable Extreme Performance Cost Effective
SmartMatrix
99.9999% high availability for the most demanding enterprise reliability needs
18 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
19 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
RAID 2.0+
Hot Hot
spare spare
20 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
N+M (N-1)+M
T0 D0 D1 D2 D3 P Q
T1 D0
21 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Chunk Chunk
… Chunk Chunk Chunk Chunk
… Chunk Chunk
Intelligent
from disk
Chunk Chunk …
pieces of data
Chunk Chunk
pieces of data
SSD SSD SSD SSD SSD SSD SSD SSD Offload SSD SSD SSD SSD SSD SSD SSD SSD
The reconstruction bandwidth of a single disk in the 23+2 RAID group is reduced from 24 times to 5 times
22 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Asynchronous
Asynchronous
A B A B
VMware Vsphere
Fusionsphere WAN WAN
Cluster
...
C C
A B A B
Synchronous Mirroring
C C
IP IP
Network Network Parallel
Production Storage Production Storage Synchronous Synchronous
HyperMetro HyperMetro
Asynchronous
Asynchronous
Site C A B A B
Standby
WAN WAN
Quorum Server C C
Ring
23 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Creating a Clone
Source
Using a Read-
Only Snapshot
HyperClone_1 ① Clone from read-only snapshot
TP0 HyperCDP_0
HyperClone_2
TP1 HyperClone_4
Remarks:
TP0, TP1, and TP2 are examples. Snapshot and clone can be associated with any time point of the Cascading
source LUN. clone
HyperCDP – Data Protection HyperSnap – Read & Write HyperClone – Data Mirror
24 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
COW ROW
Mapping table Data space Mapping table Data space
1 A
F 3 Add the old data
F 2 B block into the list of
2 Write new data at B
3 C blocks to be
the original released.
location. 4 D
5 2 E
B Copy data from the Modify the LUN F 1 Write new data into
1 original location to mapping table and a new physical
the new location. other metadata. location.
1 read, 2 writes, and 1 metadata update 1 write and 1 metadata update
25 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
1
second
Service Service Service Service Service
Inter
Manage
Protocol Data Control Communi
-ment
-cation
Stable OS Kernel
26 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
I/O
System
System Device
Device Configuration
Configuration UI/CLIUI/CLI
UI/CLIUI/CLI
Management
Management Management
Management IO Processing
Processing process
process Management
Management Management
Management
process
process process
process process
process process
process
② Self-designed Chips
① User mode upgrading, no ③ IO Processing
hold IOs and maintain ④ Host IO recovers
system reboot Process upgrading
host link information
99.9999%
Solution latency)
FlashEver without data migration
HyperCDP, HyperSnap, HyperClone,
HyperReplication, 3DC
system level reliability
RAID-TP: Tolerance for simultaneous failure of 3
99.99999%
disks
System RAID 2.0+, Dynamic Reconstruction
End to End DIF
Intelligent DAE
Non-Disruptive Upgrade
solution level reliability
Architecture SmartMatrix fully-interconnected architecture
tolerates failure of 7 out of 8 controllers
Ever Solid
Global wear leveling
Component Huawei-patented global anti-wear leveling
Intelligent IO Module
Built-in dynamic RAID utilizes space
24
28 Huawei Confidential
Hardware Extreme Extreme High
Design Reliable Performance Efficiency
29 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Global Pool
• Storage pool can spread across all the controllers and
… … use all the SSDs connected to the controllers to store all
the LUNs’ data and meta data by RAID2.0+
30 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
31 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
… … …
Global Pool: RAID2.0+, Flashlink 2.0 Global Pool: RAID2.0+, Flashlink 2.0 Global Pool: RAID2.0+, Flashlink 2.0
Mid-range/entry-level with native multipath Mid-range/entry-level with ultrapath High-end with native multipath
32 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
50us 30us
Storage Storage Storage Storage Self-developed ASIC SSD disk/enclosure:
Storage
Controller Controller Controller Controller
Engine • Read priority technology: Read requests on
SSDs are preferentially executed to respond
to hosts in a timely manner.
Shared Backend Shared Backend Shared Backend Shared Backend
• The intelligent disk enclosure is equipped
with the CPU, memory, and hardware
100G/RoCE 100us
acceleration engine. Data is reconstructed
and unloaded to the intelligent disk enclosure
to reduce latency.
Intelligent Intelligent Intelligent Intelligent • Multi-queue and polling, lock-free.
DAE DAE DAE DAE
33 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Value-added
Value-added OMM Space Mgmt.
features
User space OMM Space Mgmt.
features User space
NVMe Diver Disk Mgmt. Pool Mgmt.
Kernel space Driver Disk Mgmt. Pool Mgmt. Kernel space SAS Diver
Call each other between user mode and kernel mode Reduce interactions between two modes
High Latency Low Latency
34 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
PHY
PHY PHY
MAC
MAC
MAC NIC TOE NIC IP DTOE NIC
IP TCP
TCP buffer
Driver
buffer Socket
IP Kernel space
buffer
Kernel space
TCP Driver
Driver
OS OS Socket
Socket
OS
DIF App User space DIF App User space DIF App User space
Offloading
CPU CPU
CPU CPU Offloading workloads from controllers to
CPU CPU Intelligent DAEs.
• Improve system performance by 30%.
RoCE • Improve reconstruction speed by 100%
• Lower performance impact of
reconstruction on business from 15%
Intelligent DAE to 5%
Chips DIMM CPU DIMM DIMM CPU DIMM
PCIE
Proprietary SSD
Chips
36 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Array Array
4
Multi-Stream
3.5
Standard
3
Normal Multi-Stream
2.5
2
Standard Multi-
Stream 1.5
0.5
0
Full Block Reclaim Write Amplifier Life-Cycle
Physical Block Hot Data No Garbage movement
Deleted Data Cold Data Write Amplification reduces over
60%, life cycle expands 2 times
37 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
38 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
50us
Data
Write
A B C D E ……
39 Huawei Confidential
Extreme Reliable Extreme Performance High Efficiency
Data
update
RAID5 RAID6 RAID-TP
A B C D E Full stripe P Q
Dorado
CKG
41 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
42 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Partially identical,
sector offset
Partially identical,
bytes offset
Partially identical,
anywhere different
43 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Similar Fingerprint(SFP)
Variable length deduplication is mainly achieved by relying on similar fingerprints (SFP). For a set of data, although not identical, but with
similar data, a similar fingerprint (SFP) can be calculated for these data. Calculate multiple similar fingerprints for the same data section.
Step 1: Calculate the SFP for the original and find the Step 2: Put similar data together, select a reference
data for the same SFP. block, and find difference parts for other data.
Data #1 is selected as a reference block
Calculate
1-4’s SFPs are
Calculate SFP1, can be
put together for Mark difference parts on data #2~#4
Calculate variable length
deduplication
Calculate
Calculate
Step 3: Put similar data together, select a reference block, and difference parts for
other data.
44 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1
0 1 0 1 1 0 0 1 1 1 0.9 0
0 1 0 1 1 0 0 1 1 1 0.9
0 1 0 1 0 0 0 1 1 1 0.8
1 1 1 1 1 1 1 1 1 1 0.7 0 1 0 1 0 0 0 1 1 1 0.8 0 0
0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0.3
0 1 0 0 1 1 0 1 1 1 0.9 0
0 1 0 0 1 1 0 1 1 1 0.9
1 0 1 0 1 0 1 0 0 0 0
25%
Higher image data
reduction rate
45 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
46 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
350 TB DAE
576 TB NEW DAE
Capacity Use the latest media and more bigger capacity drive to
Consolidation replace the old drives to save the physical space and
make the capacity more flexible
H/E/N Dorado V6
Smart Migration
+ Take over the paths of the third part storage, to
Smart leverage the old storage system, smoothly cutover the Take Over
Paths
Virtualization business to run in Dorado V6 and the following ones
47 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Dorado V6
48 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Legacy DAE
NEW DAE
49 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
• Step1: Build a cluster network for each Dorado system, maximum controller is up to 128
• Step2: Join and Unjoin operation for the cluster with the different Dorado systems
50 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Original Path
51 Huawei Confidential
Take-away
52 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.
What’s NVMe?
SAS NVMe
Designed for Disk Designed for Flash/SCM
CPU Cores CORE CORE CORE CORE CORE CORE CORE CORE
CPU Cores
SSD/HDD SSD
54 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
App
Reduced interactions: Communication interactions
are reduced from 4 to 2, lowering latency
Block Layer
Controller SSD
Controller 1. Transfer command
SAS
3. Transfer data
SAS
4. Response feedback
SAS protocol stack NVMe protocol stack NVMe provides an average storage latency less than SAS 3.0.
55 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
Core N
Core 0 ... Core n Core 0 ...
vs. 0 N
0
N
… …
Lock
Multiple queues
Single queue and lock-free
with lock
SAS NVMe
SAS SSD 0 SAS SSD 24 NVMe SSD 0 ... NVMe SSD 35
...
Number of queues = 25 (Dorado 5000 SAS with 25 SSDs) Number of queues = 288 (Dorado 5000 NVMe with 36 SSDs, N = 7)
• NVMe: Every CPU core has an exclusive queue on each SSD, which is lock-free.
• Count of queues for each controller = Count of disks * Count of CPU cores for processing back-end I/O.
• SAS: Each controller has a queue to each SSD, which is shared by all CPU cores. Locks are added to ensure exclusive
access of multiple cores. The number of queues for a single controller equals to the number of disks.
56 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
57 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
What’s RoCE
PCIe vs NVMe-oF
59 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
N7 N2
DHT
ring
N6 N3
CPU
N5 N4
60 Huawei Confidential
Extremely Reliable Extreme Performance Cost Effective
CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE CORE
0 1 2 3 0 1 2 3 0 1 2 3
… … …
61 Huawei Confidential
Hyper Hyper Hyper Hyper Smart Hyper
Snap Clone Replication/A Replication/S Migration Metro
Hyper
DCL
CDP
Time Dual
Point Write
62 Huawei Confidential
Huawei HyperMetro-based Active-Active
Data Centers Disaster Recovery Solution
Security Level:
Contents
3 Competitive Analysis
4 Key Technologies
2 Huawei Confidential
Importance of Service Continuity for IT Systems
Power outage Virus or hacker attack Media Healthcare Retail Manufacturing Telecom Energy Finance
Unit: 10,000 USD
Source: Network Computing, the Meta Group and Contingency Planning Research
3 Huawei Confidential
International Standards for Disaster Recovery Construction
Recovery Point Objective (RPO): amount of lost data caused by downtime Recovery Time Objective (RTO): downtime
Several 4 to 12
Tier 4: Point-in-time copies Backup solution
hours hours
12 to 24
Tier 3: Electronic vaulting < 24 hours Backup solution
hours
24 hours to 24 hours to
Tier 2: Data backup with hot site Backup solution
days days
Tier 1: Data backup with no hot site Days Days Backup solution
Source: SHARE's seven tiers of disaster recovery released in 1992, were updated in 2012 by IBM as an eight tier model.
4 Huawei Confidential
Active-Active Data Centers Disaster Recovery Solution Ensures
24/7 Service Continuity
Active-active disaster recovery
solution
FusionSphere
Local backup solution
FusionSphere
FusionSphere
Site 1 Site 2
Site 1 Site 2 Active-active data centers
Active-passive data centers
3 Competitive Analysis
4 Key Technologies
6 Huawei Confidential
Definition of Active-Active Storage
Definition
An active-active storage solution consists of two storage systems which provide two consistent data copies in real
time. The two data copies can be concurrently accessed by one host. The failure of any copy does not affect
services. The two storage systems can be deployed in two data centers to form an active-active data centers
solution with the active-active design of upper-layer applications (as well the network layer).
Six key elements
1. Independent storage systems: An active-active relationship is established between two storage systems, and
both storage systems have independent hardware and software.
2. Active-active access: Two data copies are both in the active state (not active-passive mode) and can be
accessed by a host concurrently.
3. Convergence of SAN and NAS: SAN and NAS services can be deployed on the same device.
4. Dual arbitration modes: The active-active data centers disaster recovery solution uses an independent third-
place arbitration mechanism and supports static priority mode and quorum server mode. If the third-place
quorum server fails, the system automatically switches to the static priority mode.
5. Real-time data synchronization: Data is synchronized between the two data centers in real time and services are
automatically switched over in the event of a disaster, ensuring zero RPO and enabling the RTO to be
approximately equal to 0.
6. Smooth expansion: The active-active data centers can be further expanded to the geo-redundant layout with
three data centers.
7 Huawei Confidential
End-to-End Physical Architecture
≤ 100 km
Network layer DC outlet
Raw optical fiber
Core layer
Active-active network layer
Aggregation
Highly reliable and optimized L2 layer
interconnection
Optimal access path Access layer
Application
layer Active-active application layer
FusionSphere FusionSphere
Cross-DC high availability, load
balancing, and migration
Storage layer scheduling of Oracle RAC,
VMware, and FusionSphere
Active-active accesses
and zero data loss
Data center A Data center B
8 Huawei Confidential
Convergence of SAN and NAS
Data center A Data center B
Working principle
One storage system is deployed in data center A and the other storage
system in data center B. Both storage systems deliver read and write
Host application services in active-active mode. Write I/Os are mirrored in real time
cluster between the two storage systems to ensure data consistency. On-
demand configuration of SAN and NAS services and high availability
capabilities are achieved at the storage layer. Active-active services
are available across data centers by cooperating with hosts and
networks at the application layer. No data is lost if either storage
SAN system fails.
FC or IP FC or IP
Highlights
NAS SAN SAN NAS
Active-active, RPO = 0, RTO ≈ 0
Gateway-free configuration, deployment of both file and database
FC or IP services on the same device
Quorum server shared by both SAN and NAS services, ensuring
Production storage Production storage that either data center can provide services and ensure data
consistency in the event of a link failure
IP IP One type of networking (FC or IP) for heartbeat, configuration, and
data replication networks, meeting all SAN and NAS transmission
requirements
Quorum server
9 Huawei Confidential
SAN
Data center A Data center B Working principle
One storage system is deployed in data center A and the other storage
system in data center B. Both storage systems deliver read and write
Oracle RAC
services in active-active mode. No data is lost if either storage system
cluster
VMware vSphere fails.
cluster
FusionSphere
Highlights
cluster Active-active architecture, active-active LUNs, both data centers
WAN accessible to hosts, real-time data synchronization, RPO = 0, RTO ≈
FC or IP FC or IP 0
Gateway-free configuration, simplifying networks, cutting down the
cost, and eliminating the latency caused by gateways
Dual arbitration modes, namely static priority mode and quorum
Data mirroring
server mode, enhancing reliability
FC or IP Upgrade from a single set of equipment to active-active mode, and
Production storage Production storage
further expansion to the geo-redundant mode with three data
IP IP centers
Automatic repair of bad blocks across data centers
Storage protocol optimization, reducing the number of cross-site
Quorum server
write I/O interactions by half and accelerating the overall
performance
10 Huawei Confidential
NAS
Data center A Data center B Working principle
One storage system is deployed in data center A and the other storage
system in data center B. Both storage systems deliver read and write
Host application services in active-active mode. A pair of active-active file systems is
cluster
available. The primary storage system provides data read and write
FS 1 FS 2 FS 1 FS 2
services, and data is synchronized to the secondary storage system.
An active-active switchover is executed by the granularity of tenant
WAN pairs. No data is lost if either storage system fails.
IP
SAN
IP Highlights
FS 1 FS 2 Active-active, RPO = 0, RTO ≈ 0
Gateway-free configuration, simplifying networks, cutting down the
Real-time data
synchronization cost, and eliminating the latency caused by gateways
FS 1 FS 2 FS 1 FS 2 Dual arbitration modes, namely static priority mode and quorum
FC or IP
Production storage Production storage server mode, enhancing reliability
Upgrade from a single set of equipment to active-active mode, and
IP network IP network further expansion to the geo-redundant mode with three data
centers
Storage protocol optimization, reducing the number of cross-site
Quorum server
write I/O interactions by half and accelerating the overall
performance
11 Huawei Confidential
HyperMetro-based Active-Active Network
Data center A Data center B Host-to-storage network
SAN
Support for both FC and IP networks
Host application Full-interconnection network (recommended)
cluster Same type (IP or FC) of networks from a server to two storage systems
Dual-switch network required
NAS
Support for an IP network
Full-interconnection network (recommended)
Dual-switch network required
12 Huawei Confidential
High Availability Design
There is no need to use extra gateway If bad blocks cannot be repaired in a storage Quorum server and static priority modes
devices, reducing fault points, simplifying system, data is read from the other storage are provided, and automatic switchover
networks, and delivering higher reliability. system to repair the bad blocks. This process is supported between the two modes. If
does not affect accesses to services. the quorum server is faulty, the static
EMC Unity series needs gateways. priority mode ensures service continuity.
13 Huawei Confidential
High Performance Design
1 to 1.5 ms
The gateway-free design eliminates Write commands and data transfer are The optimistic lock technology is used to
bottlenecks in the gateways, shortens combined, reducing the latency of cross- prevent concurrent write conflicts of over 99%
the I/O path, and reduces the 1 to 1.5 ms site write I/O interactions by half. host I/Os. In this way, optimistic locks are
latency. locally locked to reduce interactions between
storage systems.
14 Huawei Confidential
Smooth Expansion to Three Data Centers
Data center A Data center B DR Remote disaster DR
management recovery center C management
Highlights
DR management DR management server DR management server Support for disaster recovery of SAN and
network network network NAS, as well as Dorado devices, ensuring
WAN WAN
database and file consistency
Communication among entry-level, mid-range,
Application
and high-end storage systems and
server DR server DR server
(optional) (optional)
communication between all-flash and non-
flash devices, cutting down the investment in
the disaster recovery centers
Fibre Fibre Fibre GUI-based disaster recovery management
Channel Channel Channel and one-click DR drill and recovery
switch switch WAN switch HyperMetro for SAN scalable to three data
centers
IP management
HyperMetro
Asynchronous network
replication IP service network
Remote disaster recovery network
Fibre Channel BCManager-based disaster recovery network
network The BCManager eReplication
Data flow management server must interconnect with
OceanStor OceanStor OceanStor the storage, Oracle server, VMware
storage system storage system storage system Server Agent vCenter, and FusionSphere VRM
management networks.
The bandwidth must: ≥ 2 Mbit/s.
Asynchronous remote replication network
Unified disaster recovery management of three data
Both FC and IP networks are supported.
Huawei disaster recovery management software, BCManager, is deployed in data center B and remote center C for The RTT must: ≤100 ms.
unified management of HyperMetro and asynchronous remote replication. The software graphically shows physical The bandwidth must ≥ 10 Mbit/s (Changed
topologies and logical service topologies of the three data centers and supports one-click test and disaster recovery in data volume in a service period/Replication
remote center C. period).
15 Huawei Confidential
Best Practice of Applying the Solution to Oracle RAC Applications
X bureau X bureau X institute X institute
Access service Data center Data center Access service
Service distribution design
• 2 + 1 cluster deployment
Applicable to data centers where service distribution is
Node 1 Node 2 Node 3 uneven
Applicable to two data centers with priorities
Service Name INSTANCE1 INSTANCE2 INSTANCE3
• Oracle RAC arbitration principles
The sub-cluster with the largest number of nodes wins.
SERVICE1 PREFERRED PREFERRED AVAILABLE The sub-cluster with the lowest node number wins if the
SERVICE2 AVAILABLE AVAILABLE PREFERRED numbers of nodes in the sub-clusters are the same.
Best practice Access isolation to reduce cache convergence
1. Store the binary files and home directories of Oracle Clusterware and Oracle database • Access isolation
in a local computer for periodic upgrades.
2. Assign 60% to 80% of system memory capacity to databases. Assign 80% of the You are advised to create different services at the Oracle
memory capacity for the OLTP database to the system global area (SGA) and 20% to RAC layer to separate services and prevent data
the program global area (PGA). Assign 50% of the memory capacity for the OLAP interactions across data centers.
database to the SGA and 50% to the PGA.
The preferred function of Oracle RAC transparent
3. If hyper-threading is enabled, you are advised to set parallel_threads_per_cpu to 1.
4. You are advised to set PARALLEL_MAX_SERVERS to application failover (TAF) is used to make applications
Min(2*parallel_threads_per_cpu*#cores, #disks). access local instances only and set instances in the
5. You are advised to use the fast_start_mttr_target parameter to control the recovery remote data center to available so that access requests
time (for example, 300 seconds).
6. To minimize the impact on performance caused by "checkpoint not complete" or
are switched to remote instances only when all local
frequent log switchovers, you are advised to create three redo log groups for each instances are faulty.
thread, and the size of a redo log should allow a log switchover to happen every 15 to You are advised not to deploy Oracle RAC in a virtualization environment
30 minutes. (advice from Oracle).
16 Huawei Confidential
Best Practice of Applying the Solution to VMware Applications
OS OS Management and OS OS
heartbeat
APD
Trigger condition: All links between ESXi hosts and storage systems fail
Storage path
but ESXi heartbeats work properly.
Symptom: VMs on the ESXi host are suspended and cannot automatically
recover.
Huawei solution: Detect paths and use a timeout mechanism to notify the
ESXi host of implementing automatic high-availability recovery for the
VMs. This is the only solution that can be used to resolve the problem.
APP APP APP APP
PDL
OS OS OS OS
Management and
Trigger condition: Links between storage systems fail and arbitration is
heartbeat
implemented but the ESXi heartbeats work properly.
Symptom: VMs on the ESXi host are suspended and cannot automatically
Storage path recover.
Huawei solution: Enable the ESXi host to identify the PDL status and
implement automatic high-availability recovery of VMs.
17 Huawei Confidential
Contents
3 Competitive Analysis
4 Key Technologies
18 Huawei Confidential
FastWrite — Higher Dual-Write Performance
1. Write command FC or IP FC or IP
1. Write command
2. Ready 2. Ready
3. Data transfer 3. Data transfer
RTT-1
RTT-1
RTT-2
8. Status Good
Common solution: A write I/O involves two FastWrite: The protocol is optimized to combine write
interactions between two storage systems, namely, command and data transfer into one transmission.
write command and data transfer. The number of cross-site write I/O interactions is
One 100-km transmission involves two RTTs. reduced by half.
One 100-km transmission link involves only one RTT.
19 Huawei Confidential
Optimistic Lock Optimization (Write Process)
Latency = t1 + t2 + t3 Latency = t1 + t3
t3 t3
Write process with distributed lock Write process with optimistic lock
20 Huawei Confidential
HyperMetro Arbitration Design
Arbitration design
• Quorum servers are deployed at a third-place site that is in a
Storage resource pool
different fault domain from the two active-active data centers.
Two quorum servers are supported to prevent single points of
failures.
Preferred site
• The failure of a quorum server does not affect active-active
services, and the arbitration mode automatically switches to
Storage system A Storage system B the static priority mode.
IP
Note: Two quorum servers work in active-standby mode. Only
one quorum server is in effect at a point in time.
• If there is no quorum server, the arbitration mode is
Standby quorum server
Third-place site Active quorum server configured to static priority mode.
Note: The failure of the preferred site interrupts services.
• Quorum device: Physical servers or virtual servers can be used as
quorum devices. Two quorum servers can be deployed. • Compared with the static priority mode, the quorum server
• Quorum link: IP addresses must be reachable. mode delivers higher reliability that ensures service continuity
• Arbitration mode: Both quorum server mode and static priority mode are in the event of a single point of failure. Therefore, the quorum
offered. server mode is recommended.
• Arbitration granularity: Arbitration is performed based on LUN pairs or
consistency groups for SAN as well as vStores for NAS.
21 Huawei Confidential
Cross-Site Bad Block Repair
Working principle
22 Huawei Confidential
Summary
Active Active DR
23 Huawei Confidential
Bring digital to every person, home, and organization
for a fully connected, intelligent world.
1
Huawei Converged Storage
Architecture and Technical Overview
1 Introduction to Converged Storage Products
3
Huawei Converged Storage Overview
OceanStor
OceanStor 5800 V5
OceanStor
OceanStor 5600 V5
OceanStor 5500 V5
2200/2600 V3 5300 V5
4
Multi-Level Convergence Makes Core Services More Agile
HyperMetro
Dorado
V5 V3
NAS+SAN SAN NAS SAN NAS
Multiple storage SAN and NAS SSD and HDD Multiple storage A-A for SAN and
types resource pools NAS
Pooling of heterogeneous Gateway-free converged data
Interconnection between Support for multiple types of HDDs and SSDs converged to
storage resources DR solution
different types, levels, and services meet the performance
Unified management and Smooth upgrade to 3DC
generations of flash storage Industry-leading performance requirements of complex
automated service orchestration
and functions services
Multi-level convergence
99.9999% service availability, satisfying complex service requirements
5
1 Introduction to Converged Storage Products
6
OceanStor Converged Storage V5 Architecture: Comparison
Converged and parallel file and block SAN over NAS file system SAN over NAS file system
services architecture architecture
RAID Manager
RAID DP / RAID4+
Storage pool RAID 2.0+
Storage subsystems
SAN
• Converged block and file services • Converged block and file storage
• Parallel processing of file system • WAFL-based architecture • Standalone NAS gateway
services and block services in a • Unified file & block manager • NAS storage pool consisting of
storage pool • Physical RAID group one or more LUNs mapped
• Storage pool based on RAID 2.0+ from SAN storage
7
OceanStor V5 Software Architecture Overview
8
Convergence of SAN and NAS: How to Implement
Disk Domain Storage Pool LUN & FS Protocol
Grain Level-1
(Thin & iSCSI/FC
CKG Level-2 cache Extent Dedupe &
Tiered cache
(RAID group) (Stripe cache) (Tiering unit) Compression) Thick LUN (LUN & FS cache)
SSD
Level-1 cache
Block-level Tiered
tiering Thin LUN
Level-2 cache
Tier0
SAS Not tiered
Thin LUN
Tier1
NL-SAS
Directly divided Grain File system
from CKG
Level-1
NFS share and CIFS share
cache
Tier2 File-level tiering
OceanStor converged storage architecture minimizes the I/O paths of SAN and NAS, and provides optimal performance.
9
OceanStor V5: Reliable Scale-out Storage
CKG CKG
App App
Active-Active
SSD HDD
10
Key Technology: RAID 2.0+ Architecture
Hot Hot
spare spare
11
More Reliable System with 20-Fold Faster Data Reconstruction
Reconstruction principle of Reconstruction principle With RAID 2.0+, data reconstruction time
traditional RAID of RAID 2.0+ plummets from 10 hours to 30 minutes
10
6
10 hours
4
2 30
minutes
0
Traditional Huawei's quick
technology recovery technology
12
SmartMatrix 3.0 Overview
13
Key Technology: SmartMatrix 3.0 Front-End Interconnect I/O Module (FIM)
14
Key Technology: SmartMatrix 3.0 Persistent Cache
A A* C C* A C C* A C
B* B D* D B D* D B D
A* B* C* A*
D* B*
A B C D A B C D A B C D
If controller A fails, controller B takes over its cache, and the cache of controller B (including that of controller A) is
mirrored to controller C or D. If controller D is also faulty, controller B or C mirrors its cache.
If a controller fails, services are switched rapidly to the mirror controller, and the mirror relationship between it and
other controllers is re-established, so each controller has a cache mirror. In this way, write-back (instead of write-
through) for service requests is ensured. This guarantees the performance after the controller failure and ensures
system reliability, because data written into the cache has mirror redundancy.
15
Key Technology: Load Balancing of SmartMatrix 3.0 Persistent Cache
Controller A Controller B Controller A Controller B
C1 D1' D1 C1'
C1 D1' D1 C1' C2 A1' D2 B2'
C2 A2' D2 B2' C3 B3' D3 A2'
C3 B3' D3 A3' A2 B1' A3 C2'
16
Key Technology: HyperMetro (Block & File)
Site A Site B
Working Principles
One device
Host application
Gateway-free, one device uses HyperMetro to support both active-active files
cluster (shared
volumes and databases.
mounted to One quorum system
HyperMetro file SAN and NAS share one quorum site. Services are provided by the same site
systems) in the event of link failures, ensuring data consistency.
One network
Storage network
between storage
The heartbeat, configuration, and physical links between two sites are
arrays and hosts integrated into one link. One network supports both SAN and NAS
IP&FC IP&FC transmission.
SAN
Real-time data
NAS SAN mirroring SAN NAS
Dual-write heartbeats
and configurations
Highlights
Active-Active, RPO = 0, RTO ≈ 0
Requires no gateway devices, simplifying networks, saving costs, and
FC/IP eliminating gateway-caused latency.
Production storage Production storage Supports two quorum servers, improving reliability
Supports flexible combination of high-end, mid-range, and entry-level
IP IP storage arrays for active-active solutions, saving investment.
Supports smooth upgrade from active-active or active-passive solutions to
3DC solutions without service interruption.
Flexibly supports 10GE or FC networks for intra-city interconnection and IP
networks for quorum links.
Quorum site
17
1 Introduction to Converged Storage Products
18
OceanStor Converged Storage Features
SAN Features
iSCSI FC
19
OceanStor Converged Storage Features
NAS Features
20
SmartMulti-Tenant Architecture: Network, Protocol, and Resource Virtualization
Protocol virtualization
vStore vStore
AD Client
Client
Client
Client
Client LDAP/NIS
Client
Lock manager
FS 0 FS 1 FS 2 FS 3 VFS
21
HyperMetro for NAS
Site A Site B
Working Principles
FS FS FS FS High-availability synchronous mirror at a file-system level: When data is
written to the primary file system, it will be synchronously replicated to the
secondary file system. If the primary site or file system fails, the secondary site
HyperMetro vStore pairs:
(vStore1 vStrore1') or file system will automatically take over services without any data loss or
IP/FC IP/FC
(vStore2 vStrore2') application disruption.
HyperMetro pairs:
FS1 FS1' (FS1 FS1')
vStore1 vStore1' (FS2 FS2')
FS2
Data and
FS2'
(FS3 FS3') Highlights
FS3 configuration FS3' (FS4 FS4')
vStore2 sync
vStore2'
Gateway-free deployment
FS4 FS4' 1 network type between sites
FS5 FS5' 2 components required for smooth upgrade
vStore3 vStore3'
FS6 FS6' 3 automatic fault recovery scenarios
4x scalability
5x switching speed
IP IP
Quorum server
22
HyperMetro Architecture
LADP, AD, and NIS servers
Client
File system
FS 0 FS 1 FS 2 FS' 0 FS' 1 FS' 2
23
Service and Access Processes in the Normal State
1-7: Configurations made by the administrator on
NAS client Admin
vStore A are synchronized to vStore A' in real
10.10.10.1 time, such as the quota, qtree, NFS service,
11 18 CIFS service, security strategy, user and user
1 7
10.10.10.10 10 8 10.10.10.10
group, user mapping, share, share permission,
DNS, AD domain, LDAP, and NIS.
2
NAS service CFG sync
4
CFG sync NAS service If a failure occurs, the changed configurations
12 17 2 3 5 6 are saved in the CCDB log and the vStore pair
status is set to "to be synchronized". After the
File system CCDB CCDB File system
link is recovered, the configurations in the CCDB
13 16 vStore A vStore A'
16
14 14 14 log are automatically synchronized to vStore A'.
Object set Data sync Data sync Object set
14 15
15 15 15 14 15 8-18: When a NAS share is mounted to the
Concurrent write
9
Cache Cache client, the storage system obtains the access
permission of the share path based on the client
IP address. If the network group or host name
Storage pool Storage pool
has the permission, the client obtains the handle
of the shared directory. If a user writes data into
a file, the NAS service processes the request
Storage system A Storage system B
and converts it into a read/write request of the
file system. If it is a write request, the data
LADP/NIS
synchronization module writes the data to the
server caches of both sites simultaneously, and then
returns the execution result to the client.
24
Service and Access Processes During a Failover
10.10.10.1
1 8
10
10.10.10.10 10.10.10.10 1-8: When vStore A is faulty, vStore A'
13
detects the faulty pair status and
NAS service CFG sync CFG sync NAS service applies for arbitration from the quorum
11 12 2 7 server. After obtaining the arbitration,
File system CCDB CCDB File system vStore A' activates the file system, NAS
vStore A vStore A' 3 6 service, and LIF status. NAS service
Object set
14 Data
CCDB log Object set configuration differences are recorded
synchronization 4 5 in the CCDB log, and data differences
Cache DCL
9 Cache are recorded in the data change log
9 (DCL). In this manner, vStore A can
Storage pool Storage pool synchronize incremental configurations
and data upon recovery.
Storage system A Storage system B
The CCDB log and DCL are configured
with power failure protection and have
high performance.
LADP/NIS server
25
NFS Lock Failover Process
Synchronize a client's IP address pair Notify the client The client reclaims the lock
Mount1: Mount1:
10.10.10.11:/fs1 Mount1:
10.10.10.11:/fs1
10.10.10.11:/fs1
10.10.10.1 10.10.10.1
10.10.10.1
Notify Reclaim
10.10.10.11 10.10.10.11
10.10.10.11 (inactive) 10.10.10.11 10.10.10.11
(inactive) 10.10.10.11
Back up
Configuration client info Configuration Read
synchronization synchronization configuration
CCDB
CCDB CCDB
1. HyperMetro backs up a 1. The NAS storage reads the list 1. The client sends a lock
client's IP address pair to of IP address pairs from the reclaiming command to the
remote storage. CCDB. storage.
2. The NAS storage sends 2. The storage recovers byte-
NOTIFY packages to all clients range locks.
to reclaim locks.
26
NAS HyperMetro: FastWrite
RTT-2
8. Status good
27
HyperSnap
The host
modifies data D
Before snapshot During snapshot Modify data after a
creation creation snapshot is created
Active Snapshot Active Snapshot Active
volume mapping table volume mapping table Volume Copy-on-write (COW)
Used by LUNs of OceanStor
A B C D A B C D A B C D1 D V5 storage
New
data Modified
data
Data block D is copied and the mapping table is modified.
A B C D A B C D A B C D B D1 E1 E2
Deleted Modified New
data data data
28
HyperVault
2.1 Working Principles (1)
The initial backup of HyperVault is
a full backup, and subsequent
backups are incremental backups.
based on file systems, the backups
are completely transparent to hosts
and applications.
Each copy at the backup file
system contains full service data,
not only the incremental data.
The data at the backup file system
is stored in the original format and
is readable after the backup is
complete.
29
HyperVault
30
DR Star (SAN)
I/O process:
DC 1
1. The host delivers I/Os to the primary LUN-A.
LUN-A (Ta) 2. The primary site dual-writes the I/Os to the secondary LUN-B.
Asynchronous replication (standby)
3. A write success is returned to the host.
4 4. Asynchronous replication starts and triggers LUN-A to activate the time slice
1 Ta+1. New data written to LUN-A is stored in this time slice, and the Ta slice is
LUN-A (Ta+1) DC 3 used as the data source for the standby asynchronous replication.
3
5. LUN-B activates a new time slice Tb+1, and the new data is stored in this time
LUN-C (Tc+1) slice. LUN-C activates a new time slice Tc+1 as the target of asynchronous
replication. Tc is the protection point of asynchronous replication rollback.
Active-active 2 5 6. LUN-B (Tb) is the data source for asynchronous replication to LUN-C (Tc+1).
Standby
DC 2 6 host
Data in DC1 and DC2 is synchronous. After the data is copied from Tb to Tc+1,
LUN-C (Tc)
the data in Ta is also copied to Tc+1. This process is equivalent to the
LUN-B (Tb+1) asynchronous replication between DC1 and DC3. If DC2 is faulty, DC1 and DC3
5 are switched to asynchronous replication. Incremental data is replicated from Ta
Asynchronous replication to DC3.
Standby
LUN-B (Tb)
host
Compared with the common 3DC solution:
1. There is a replication relationship between every two sites. Only one of the
Item Huawei H** E**
two asynchronous replication relationships has I/O replication services and
Active-Active + the other one is in the standby state.
asynchronous remote Supported Supported Not supported 2. If the working asynchronous replication link is faulty or one of the active-
replication active sites is switched over, the working link is switched to the standby link.
Synchronous remote Then, incremental synchronization can be implemented.
replication + asynchronous Supported Not supported Supported 3. You only need to configure DR Star at one site.
remote replication 4. The DR Star supports active-active + asynchronous + standby and
synchronous + asynchronous + standby networking modes. The
Configured at one site Supported Not supported Supported
asynchronous + asynchronous + standby networking mode is not supported.
31
SmartTier (Intelligent Tiering)
LUN
Tier0: SSD Tier1: SAS Tier2: NL-SAS Relocates data based on the rank
Data relocation
and relocation policy.
ROOT
Dir Dir
File system
Indicates the user-defined file write policy and
relocation policy.
The supported attributes include the file
File policy
size/name/type/atime/ctime/crtime/mtime.
File
Scans the list of files to be relocated based on
File distribution analysis the file policy.
Tier0: SSD Tier1: SAS/NL-SAS File relocation Relocates files based on policy.
32
SmartTier for NAS (Intelligent File Tiering)
File system
Highlights
Tiering policy
33
SmartTier File Relocation Principles
SmartTier policy
Relocation period
• Specifies the start time.
Add files to the background • Specifies the running duration.
relocation task. • Can be Paused.
34
SmartTier for NAS and Background Deduplication & Compression
Configure SmartTier to improve performance and save space:
Enable SmartTier for the file system and configure the automatic relocation mode where all data is written into the performance tier
(SSD tier).
Set the SmartTier relocation time from 22:00 to 05:00.
In SmartTier, enable deduplication and compression during relocation.
Performance tier
(SSD)
Create a file system. New data is Data is deduplicated and Deduplication New data is
written to SSDs. compressed when and compression written to SSDs.
relocated to HDDs. are complete.
35
THANK YOU
2019/9
Objectives
Upon completion of this course, you will be able to understand OceanStor Dorado V6's key reliability
features and their technical principles.
2 Module-Level Reliability
3 System-Level Reliability
4 Solution-Level Reliability
5 O&M Reliability
Level 1:
Module level Hardware
Component Device Environment Production Disk
Lean manufacturing and reliability
processing ensure the
yield rate.
2 Module-Level Reliability
3 System-Level Reliability
4 Solution-Level Reliability
5 O&M Reliability
Media
• Soft failure components
prevention • Reliable upgrade
Reliability of Huawei- HSSD reliability • Reliable
• Protocol data
developed chips • Backup power • Load balancing algorithm integrity expansion
• Error detecting • ECC/CRC reliability • Error correction algorithm • Data storage
• Error handling • BIST • RAID • Bad block management redundancy
Running fault
• Key hardware signal • Board
detection and self-
Board
50% 50%
x Block Block
Uncorrectable
• Die-level Multi-copy & RAID: metadata (multiple copies) and user data (RAID)
• Data restoration: LDPC, read retry, and intra-disk XOR that enable data Wear leveling: periodically moves data blocks so that the data blocks with less wear
restoration using redundancy can be used again.
2 Unused 1 DISK
DISK DISK
DISK
Unused
Block 5 Block
Reservoir
1
Unused
1. Background inspection: combines read inspection and write inspection and 1. Online self-healing: restores a disk to its factory settings online.
proactively reports bad blocks detected during inspection. 2. Die failure: active reporting and capacity reduction
2. Bad block isolation: detects, migrates, and isolates bad blocks. 3. Power failure protection: flush dirty data when power failure using capacitor
Conventional SSD: silent data corruption HSSD: bad block (die) self-recovery + single-disk
(D3) + single-disk fault → data loss fault → no data loss
RAID Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 4 (intra-HSSD RAID)
CKG1 D0 D1 D2 D3 P D4 Data
RAID 4-x Die Die Die Die Die Die column
Without RAID: Silent data corruption may occur on disks. (HDDs may have bad sectors and SSDs may have bad blocks, on which data is
unavailable.) If such bad sectors or blocks are seldom accessed, corruption of data in them cannot be detected or rectified in time. Once a
disk fails, data in these bad sectors or blocks cannot participate in reconstruction, resulting in loss of user data.
With RAID: HSSDs periodically scan data blocks and restore detected bad blocks using intra-disk RAID. In addition, data in bad blocks can
be restored in real time using inter-disk RAID when these blocks are accessed by a host or participate in data reconstruction. In this way, data
will not be lost.
2 Module-Level Reliability
3 System-Level Reliability
4 Solution-Level Reliability
5 O&M Reliability
Host
1
Front end Controller Controller Front end
Interface
module
Interface
module
enclosure enclosure Interface
module
Interface
module
E2E Redundancy Design
A B
Interface
Interface
end interconnect I/O modules (FIMs), protocol offloading,
module
module
A B C D A B C D
Interface
Interface
module
module
and controller failover within seconds
2. Services are not interrupted if multiple controllers are
Software Software Software Software 5 Software Software Software Software
faulty (HyperMetro-Inner): three cache copies,
3 Interface
Interface
module
module
Interface continuous mirroring, cross-engine mirroring, and full-
Interface
module
module
mesh back end.
3. Services are not affected in the case of a software
fault: process availability detection, startup of processes
Interface Interface Interface Interface within seconds upon a fault, and intermittent isolation of
module module
Inter- Inter- module module
enclosure enclosure frequently abnormal background tasks.
Back end exchange exchange Back end
4. Services are not interrupted if multiple disks are
faulty: EC-2/EC-3 for user data
Disk
enclosure 5. Controllers do not reset if interface modules for inter-
4 enclosure exchanging are faulty: multiple interface
RAID ...
modules for redundancy on high-end storage, and TCP
... ... ... ... ... ... ... ... ... ... ...
forwarding for mid-range and entry-level storage with a
RAID ... single interface module
Controller
enclosure A
1. The host delivers I/Os to controller A: If all
FIM 1 FIM 2
controllers are normal, I/Os are delivered to
controller A through FIM 1.
Controller A Controller B Controller C Controller D 2. Controller A is faulty: FIM 1 and controller B
4 detect that controller A is unavailable by means
1 of interrupts.
3
3. Service switchover: Services are quickly
2 X (within 1s) switched to controller B that has data
copies of controller A, by switching the vNode.
Then FIMs are instructed to refresh the
distribution view.
BIM 1 BIM 2 4. I/O path switchover: FIM 1 returns BUSY for
the I/Os that have been delivered to controller
Disk enclosure A. The retried and new I/Os delivered by the
... host are delivered to controller B based on the
RAID
... ... ... ... ... ... new view.
RAID ...
2* 1* 4* 3* 1 4* 3* 1 4
3* 1*
1* 2*
4* 2*
Controller A Controller
Controller B Controller C
B Controller C Controller
Controller D
D Controller AA Controller
Controller Controller B
B Controller C Controller
Controller C Controller D
D Controller A Controller B Controller
Controller A C Controller
Controller C Controller D
D
Normal Failure of one controller (controller A) Failure of one more controller (controller D)
Continuous mirroring (ensuring service continuity even when seven out of eight controllers are faulty): If controller A is faulty,
controller B selects controller C or D as the cache mirror. If controller D fails at the moment, cache mirroring is implemented between
controller B and controller C to ensure dual-copy redundancy.
Service continuity: If a controller fails, its mirror controller establishes a mirror relationship with another functional controller within 5
minutes. This design increases the service availability by at one nine and ensures service continuity in the event that multiple
controllers fail successively.
Shared front end Shared front end Shared front end Shared front end
Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller
A B C D A B C D A B C D A B C D
Shared back end Shared back end Shared back end Shared back end
Smart Smart
disk enclosure disk enclosure
The global cache provides three cache copies across controller enclosures. The global cache provides three cache copies across controller
If two controllers fail simultaneously, at least one cache copy is available. enclosures.
A single controller enclosure can tolerate simultaneous failure of two controllers A smart disk enclosure connects to 8 controllers (in 2 controller
with the three-copy mechanism. enclosures) through BIMs.
If a controller enclosure fails, at least one cache copy is available.
Burst Quota
• Token accumulation: If the performance of a LUN, snapshot, LUN group, or
host is lower than the upper threshold within a second, a one-second burst
duration is accumulated. When the service pressure suddenly increases, the
LUNs whose traffic LUNs whose LUNs whose performance can exceed the upper limit and reach the burst traffic. The
does not reach the traffic reaches the traffic far exceeds accumulated tokens are used by the current objects and last the configured
lower limit lower limit the lower limit duration. In this way, the system can respond to the burst traffic in time.
No Burst Traffic
suppression prevention suppression Lower Limit Guarantee
• Minimum traffic: If each LUN is configured with the minimum traffic
(IOPS/bandwidth) by default, the minimum traffic must be ensured when the
LUN is overloaded.
• Traffic suppression for high-load LUNs: When the system is overloaded, if
the traffic of some LUNs does not reach the lower limit, the system performs
... load rating on all LUNs. The system provides a loose traffic condition for
medium- and low-load LUNs based on the load status. The system suppresses
Disk the traffic of high-load LUNs until the system releases sufficient resources to
enable the traffic of all LUNs to reach the lower limit.
Host
Host
E2E Data Protection
C1 C2 C3 C4 C5 C6 C7 C8
data
1. Cache data redundancy: Two or three copies of
Controller enclosure A Controller enclosure B cache data ensure no data loss in the case that
multiple controllers or a single controller enclosure is
Controller Controller Controller Controller Controller Controller Controller Controller
A B C D A B C D faulty.
C1 C1 C5 C5 C5 C1 2. Disk data redundancy: RAID 2.0+ ensures that user
data on disks will not be lost in the case that multiple
C6 C2 1 C2 C2 C6 C6 disks are consecutively or simultaneously faulty. Data
reconstruction is offloaded to smart disk enclosures,
C7 C3 C3 C3 C7 C7 further ensuring data reliability.
3. Intra-disk data redundancy: RAID 4 ensures die-
C4 C4 C8 C8 C8 C4
level redundancy within a disk, preventing user data
loss in the case of bad blocks or die failure.
5 4. Maintained data redundancy even when RAID
22+2 Disk
fault
21+2 4 disks are insufficient: Dynamic reconstruction, in
which fewer data disks are involved, is used to
2
maintain redundancy if the number of member disks
Disk Disk Disk Disk
... Disk Disk Disk does not meet RAID requirements.
C1 C2 C3 C4 C5 C6 C7 +2
5. E2E data consistency: E2E PI and parent-child
hierarchy verification ensure that data on I/O paths will
not be damaged.
Die Die Die Die Die Die Die Die ... Die Die
3 +1
Cache Copies
Third copy saved in another
Controller enclosure 0 controller enclosure
Controller enclosure 1
Data will not be lost if two controllers are faulty: Three copies of cache data are supported. For host data with
the same LBA, the system creates a pair of cache data copies on two controllers and the third copy on another
controller.
Data will not be lost if a controller enclosure is faulty: When the system has two or more controller enclosures,
three copies of cache data are saved on controllers in different controller enclosures. This ensures that cache data
will not be lost in the event that a controller enclosure (containing four controllers) is faulty.
Conventional LUN
Block
LUN
RAID virtualization-
based RAID
RAID Hot
RAID
group 1
spare
group 2
⊕
HDD 2 HDD 6
68
24 25 26 64 65 66
27 28 29
HDD 3
67 68 69
HDD 7
⊕
HDD 3 HDD 7 31 32 33 71 72 73
CKG 3 21 34 51 71 99
Rebuilding 34 35 36 74 75 76
37 38 39 77 78 79
Rebuilding
Reconstruction using conventional RAID: During reconstruction,
data is read from the other functional disks and reconstructed. Then, Reconstruction using RAID 2.0+: RAID 2.0+ supports dozens of member
reconstructed data is written to a hot spare disk or a new disk. The disks. When a disk fails, other disks participate in reconstruction reads and
write performance of a single disk restricts the reconstruction. writes, greatly shortening the reconstruction time. As more disks share
Therefore, the reconstruction takes a long time. reconstruction loads, the load on each disk significantly decreases.
Offloading
3 Calculates the faulty Controller-based Reconstruction
Controller blocks. 4 Writes faulty
blocks into the hot
1. Reconstruction occupies controller computing resources
1 Initiates a disk spare space.
read request. (CPU resources): If a single or multiple disks are faulty, all
data is computed on the controller, causing the controller
Disk enclosure 1 Disk enclosure 2 CPU to be overloaded. The host I/O processing capabilities
are adversely affected.
2 Reads 23 blocks.
2. Reconstruction occupies massive data write bandwidth:
All data on the disks in the RAID group is read to the
Disk 1 Disk 2 ... Disks
3-5
Disk
36
Disk 1 Disk 2 ... Disk
35
Disk
36
controller for computing, occupying the data write bandwidth.
As a result, the host I/O write bandwidth is affected.
5 Use P', Q', P", and Q" Reconstruction Offloading (2x Better Performance)
Controller to calculate D1 and Dx'. 6 Writes faulty
blocks into the hot
1.1 Initiates a spare space.
1. The computing of RAID member disks is offloaded to
4.1 Calculates
reconstruction task. 1.2 Initiates a 4.1 Calculates
smart disk enclosures: Smart disk enclosures have idle
transmission
P' and Q'. reconstruction task. transmission P" and Q". CPU resources. The disk recovery data read by the IP
enclosure is calculated in the enclosure by using P' and Q'.
Disk Disk 2. Reconstruction occupies a small amount of data write
Disk read Compute Disk read Compute
enclosure 1 module module enclosure 2 module module bandwidth: When RAID data recovery is involved, data on
2.1 Initiates a
2.1 Initiates a disk 3.1 Reads 3.2 Reads the disk to be recovered is computed in the smart disk
disk read 12 blocks.
enclosure and does not need to be transmitted to the
Disk
read request.
Disk
... Disks
12 blocks.
Disk
request.
Disk Disk
... Disk Disk
controller, reducing back-end bandwidth occupation and
reducing the impact of reconstruction on system
1 2 3-5 36 1 2 35 36
performance.
Reconstruction
Normal: RAID (4+2) A RAID member disk is faulty. Dynamic reconstruction: RAID (3+2)
Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6 Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Disk 6
D0+D2+D3=P'+Q'
Insufficient
disks
For a RAID group has M+N members (M indicates data columns and N indicates parity columns):
Common reconstruction: When a disk is faulty, the system uses an idle CK to replace the faulty one and restores data to the idle CK. If the
number of member disks in the disk domain is fewer than M+N, two CKs reside on the same disk, decreasing the RAID redundancy level.
Dynamic reconstruction: If the number of member disks in the disk domain is fewer than M+N, the system reduces the number of data
columns (M) and retains the number of parity columns (N) during reconstruction. This method retains the RAID level, ensuring system
reliability.
Data Protection
Protection point 2: checksum Protection point 3: Metadata CRC
Host verification between data and Protection point 4: Metadata DIF
8 KB = 16 sectors (512 B* + 8 B) metadata verification
Writes data. Reads data 512 B PI (8 B)
Per 8 KB
Inserts PI. Verifies PI. 512 B PI (8 B)
Checksum 508 B CRC (4 B) PI (8 B)
512 B PI (8 B)
Metadata (parent)
Storage system software ... Protection point 5: Parent-
child hierarchy verification of
512 B PI (8 B) metadata
508 B CRC (4 B) PI (8 B)
Verifies PI. 512 B PI (8 B)
Verifies PI.
…… Data
Protection point 1: data PI Metadata (child)
verification *B: bytes
Data protection at hardware boundaries: Data verification is performed at multiple key nodes on I/O paths within a storage system, including front-end chips,
controller software front end, controller software back end, and back-end chips.
Multi-level software based data protection: For each 512-byte data, in addition to 8-byte PI (two bytes of which are CRC bytes), the system extracts the CRC
bytes in 16 PI sectors to form the checksum and stores the checksum in the metadata node. If skew occurs in a single or multiple pieces of data (512+8), the
checksum is also changed and becomes inconsistent with that saved in the metadata node. When the system reads the data and detects the inconsistency, it uses
the RAID redundancy data on other disks to recover the error data, preventing data loss.
Metadata protection: Metadata is organized in a tree structure, and a parent node of metadata stores the CRC values of its child nodes, which is similar to the
relationship between data and metadata. Once the metadata is damaged, it can be verified and restored using the parent and child nodes.
Protection (Example)
Host
Calculates data CRC and compares it with existing PI. If they are
Inserts PI. inconsistent, data is damaged. The system then restores the data
using RAID and returns a success message to the host.
Service layer
Data Layer
Verifies PI and marks that data Reads data and verifies PI. If the verification fails, the system
is incorrect if verification fails. considers that data on the disk is damaged, uses RAID to
restore the data, and returns a success message to the upper
layer.
Disk
• Operation
record
• Access model • Environment
optimization monitoring
• Service flow control and • Hibernation
balancing Operation • Firmware
• Separation of user data Management
management
and metadata
Access
• Wear leveling/Anti– optimization
wear leveling
Diagnosis
Prediction
• Health evaluation
• Bad sector/block scanning
and repair
• Online diagnosis
Anti-wear Leveling
Global Wear Leveling and Anti-wear Leveling
100%
Wear Leveling
80%
60%
SSDs can only withstand a limited number of read and write
Wear operations. The system evenly distributes workloads to
leveling 40% each SSD, preventing some disks from failing due to
20% continuous frequent access.
0%
SSD0 SSD1 SSD2 SSD3 SSD4 SSD5 SSD6 SSD7 SSD8 SSD9
90%
Anti- To prevent simultaneous multi-SSD failures, the system
80% starts anti-global wear leveling when detecting that the SSD
wear
leveling wear has reached the threshold. Unbalanced data
70%
distribution on the SSDs makes their wear degrees differ by
60% at least 2%.
SSD0 SSD1 SSD2 SSD3 SSD4 SSD5 SSD6 SSD7 SSD8 SSD9
IP/FC SAN
Technical Highlights
The disk vulnerability can be detected and removed as early as
0 1 2 P0 possible to minimize the system risks due to disk failure.
3 4 P1 5
3' 4' P1'
RAID/CKG
Technical Highlights
When all disks in the system become slow, the response time of a single disk
cannot be much greater than that of other disks. In this case, no disk will be
isolated, reducing the failure rate.
Disk Slow disks are identified and isolated only after diagnosis and repair.
Abnormal Normal
2 Module-Level Reliability
3 System-Level Reliability
4 Solution-Level Reliability
5 O&M Reliability
...
CloudBackup
...
Overview
• HyperSnap quickly captures online data and
generates snapshots for source data at specified
points in time without interrupting system services,
preventing data loss caused by viruses or
08:00 snapshot
misoperations. Those snapshots can be used for
backup and testing.
12:00 snapshot
Source Snapshot
LUN LUN
16:00 snapshot Highlights
• The innovative multi-time-segment cache technology
20:00 snapshot
can continuously activate snapshots at an interval of
several seconds. Activating snapshots does not block
OceanStor
host I/Os, and host services can be quickly
responded.
• Based on the RAID 2.0+ virtualization architecture, the
system flexibly allocates storage space for snapshots,
making resource pools unnecessary.
Overview
HyperClone generates a complete physical
copy (target LUN) of the production LUN
(source LUN) at a point in time. The copy can
be used for backup, testing, and data analysis.
Highlights
A complete, consistent physical copy
Isolation of source and target LUNs,
eliminating mutual impact on performance
Consistency groups, enabling the consistent
splitting of multiple LUNs
Incremental synchronization
Reverse incremental synchronization (from the
target LUN to the source LUN)
Production
Storage
Storage
Real-time data
synchronization Flexible scalability design: SmartVirtualization, HyperSnap, and
HyperReplication are supported. HyperMetro can be expanded to the
Disaster Recovery Data Center Solution (Geo-Redundant Mode).
IP IP
network network
Quorum device(Optional)
— HyperReplication
Overview
Production center DR center
HyperReplication supports both synchronous and
asynchronous remote replication between storage
systems. It is used in disaster recovery solutions to provide
intra-city and remote data protection, preventing data loss
caused by disasters and improving business continuity.
Highlights
A A
Synchronous/ Synchronous/
Asynchronous Asynchronous
replication replication
SAN Asynchronous SAN SAN
SAN
replication
A A' A A'
Remote DR Remote DR
center C center C
1. The intra-city DR center undertakes
remote replication tasks and has Asynchronous
replication
minor impact on services in the
production center.
If the storage system in the
2. If the storage system in the production
SAN production center malfunctions, SAN
center malfunctions, the intra-city DR
the intra-city or remote DR center
center takes over services and keeps
A" can quickly take over services.
a data replication relationship with the A"
remote DR center.
2 Module-Level Reliability
3 System-Level Reliability
4 Solution-Level Reliability
5 O&M Reliability
Host
Upgrade Transparent to Hosts
Controller enclosure Host unaware of upgrade: Each controller has an I/O
Front-end Front-end holding module, which holds current I/Os during
interface module interface module component upgrade and restart. After being upgraded,
I/O holding components continue to process the I/Os so that hosts
module do not detect any connection interruption or I/O
Controller Controller Controller Controller
A B C D
exception.
Component upgrade: The system upgrade is divided
Phase 1 I/O holding I/O holding I/O holding I/O holding into two phases. The software components (processes)
Component Component Component Component
with redundant units are upgraded first. After the
1 1 1 1 software packages are uploaded and the processes
Component
2
Component
2
Component
2
Component
2
are restarted, the second phase is triggered.
Component Component Component Component Zero performance loss: Each software component
N N N N
Phase 2 restarts with 1s. The front-end interface module returns
BUSY for failed I/Os during the upgrade. The host re-
delivers the failed I/Os, and the performance is restored
to 100% within 2 seconds.
Short upgrade duration: No host compatibility issue is
Disk enclosure involved. Host information does not need to be
collected for evaluation. The entire storage system can
Fast restart ……… be upgraded within 10 minutes as controllers do not
need to be restarted.
Verify enterprise data center models Optimize idle resources to Evaluate the capacity
for over 600 days. improve resource requirements and provide
utilization. detailed expansion solutions.
2 Module-Level Reliability
3 System-Level Reliability
4 Solution-Level Reliability
5 O&M Reliability
• O&M Reliability
Fast Upgrade
Copyright © 2020 Huawei Technologies Co., Ltd. 2019. All rights reserved.
All logos and images displayed in this document are the sole property of their respective copyright holders. No endorsement, partnership, or affiliation is suggested or
implied. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results,
future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed
or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Department:
Prepared By:
Date: Feb., 2020
2 Huawei Confidential
Typical Application Scenarios of Distributed Storage
Virtual storage pool HDFS HPC/Backup/Archival
Storage-compute separation
Compute node Compute node Compute node File storage VM Application Database service
Cloud storage resource
pool Ethernet
Public cloud
Cloud storage resource pool
• Storage space pooling and on-demand expansion, • Storage-compute separation, implementing elastic compute • HPC meets requirements for high bandwidth and intensive
reducing initial investment and storage scaling IOPS. Cold and hot data can be stored together.
• Ultra-large VM deployment • EC technology, achieving higher utilization than the • Backup and archival, on-demand purchase, and elastic
• Zero data migration and simple O&M traditional Hadoop three-copy technology expansion
• TCO reduced by 40%+ compared with the traditional solution • Server-storage convergence, reducing TCO by 40%+
3 Huawei Confidential
Core Requirements for Storage Systems
36.4%
Performance Usability
Users'
core
requirements
4 Huawei Confidential
Current Distributed Storage Architecture Constraints
Mainstream open-source software in the industry Mainstream commercial software in the industry
User LUN/File
Three steps to query the mapping:
LUNs or files are mapped to multiple
Obj
continuous objects on the local OSD.
Obj Obj
Obj
Obj Obj Obj Obj
Obj
Obj Obj
Obj
Protection groups (PGs) have a great
impact on system performance and
layout balancing. They need to be
dynamically split and adjusted based PG PG PG PG
PG PG PG PG
on the storage scale. This adjustment PG PG PG PG
PG PG PG PG
affects system stability. PG PG PG PG
CRUSH map
The CRUSH algorithm is simple, but does not support RDMA, and the Multiple disks form a disk group (DG). One DG is configured with one SSD cache. Heavy I/O
Performance performance for small I/Os and EC is poor. loads will trigger cache writeback, greatly deteriorating performance.
Design-based CRUSH algorithm constraints: uneven data distribution, uneven disk
Reliability space usage, and insufficient subhealth processing
Data reconstruction in fault scenarios has a great impact on performance.
Restricted by the CRUSH algorithm, adding nodes costs high, and large-scale
Scalability capacity expansion is difficult.
Poor scalability: Only up to 64 nodes are supported.
Usability Lack of cluster management and maintenance interfaces with poor usability Inherit the vCenter management system with high usability.
5 Huawei Community
Confidentialproduct with low cost: EC is not used for commercial, and High cost: All SSDs support intra-DG deduplication and non-global deduplication, but do not
Cost deduplication and compression are not supported support compression. EC supports only 3+1 or 4+2.
Overall Architecture of Distributed Storage (Four-in-One)
VBS (SCSI/iSCSI) NFS/SMB HDFS S3/SWIFT
Disaster Recovery O&M Plane
Block File HDFS Object HyperReplication Cluster
LUN Volume DriectIO L1 Cache NameNode LS OSC
HyperMetro Management
L1 Cache MDS Billing
DeviceManger
Hardware
x86 Kunpeng
Architecture advantages
• Convergence of the block, object, NAS, and HDFS services, implementing data flow and service interworking
• Introducing advantages of the professional storage to achieve optimal convergence of performance, reliability,
usability, cost, and scalability
• Convergence of software and hardware, optimizing performance and reliability based on customized hardware
6 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem
7 Huawei Confidential
Data Routing, Twice Data Dispersion, and Load Balancing
Among Storage Nodes
Front-end NIC User Data
Basic Concepts
... N1 SLICE • Node: physical node, that is, a storage server
Front-end module N7 DHT N2
loop
• Vnode: logical processing unit. Each physical node
for data dispersion N3
N6
(first) is divided into four logical processing units. When a
N5 N4
physical node becomes faulty, services processed
by the four vnodes on the faulty node can be taken
vnode vnode
Data processing
over by the other four physical nodes in the cluster.
module vnode vnode
In this case, service takeover efficiency and load
balancing can be improved.
Node
• Partition: A fixed number of partitions are created in
the storage resource pool. Partitions are also units
Plog 1 Plog 2 Plog 3
for capacity expansion, data migration, and data
Data storage ... reconstruction.
module for data
dispersion
(second)
• Plog: partition log for data storage, providing the
read/write interface for Append Write Only services.
Partition 01
The size of a Plog is not fixed. It can be 4 MB or 32
MB, and the maximum size is 4 GB. The Plog size,
Partition 02
redundancy policy, and partition where data is
SSD/HDD stored will be specified during service creation.
Partition 03
When creating a Plog, select a partition based on
load balancing and capacity balancing.
Node Node-0 Node-1 Node-2 Node-3
8 Huawei Confidential
I/O Stack Processing Framework
The I/O channels are divided into two phases:
• Host I/O processing (① red line): After receiving data, the storage system stores one copy in RAM cache and three copies in SSD WAL
cache of three storage nodes. After the copies are successfully stored, a success response is returned to the upper-layer application
① host. The host I/O processing is complete.
• Background I/O processing (② blue line): When data in the RAM cache reaches a certain amount, the system calculates large data
blocks using EC and stores generated data fragments onto HHDs. ( ③ Before data is stored onto HDDs, the system determines
whether to send data blocks to the SSD cache based on the data block size.)
Data Parity
RAM Erasure Coding ②
cache
SSD WAL
cache
SSD cache ③
HDD
9 Huawei Confidential
Storage Resource Pool Principles and Elastic EC
EC: erasure coding, a data protection mechanism. It implements data redundancy protection by
calculating parity fragments. Compared with the multi-copy mechanism, EC has higher storage utilization
and significantly reduces costs.
EC redundancy level: N+M = 24 (M = 2 or 4), N+M = 23 (M = 3)
Disk Disk Disk Disk Disk Disk node1 Node3 node1 node3 node2 Node2
Partition2
Disk10 Disk11 Disk1 Disk5 Disk3 Disk2
Disk Disk Disk Disk Disk Disk
...
Disk Disk Disk Disk Disk Disk node3 node2 node1 node3 node1 Node2
Partition
Disk10 Disk11 Disk1 Disk5 Disk3 Disk2
Basic principles:
1. When a storage pool is created using disks on multiple nodes, a partition table will be generated. The number of rows in the partition table is 51200 x 3/(N+M),
and the number of columns is (N+M). All disks are filled in the partition table as elements based on the reliability and partition balancing principles.
2. The mapping between partitions and disks is N:M. That is, a disk may belong to multiple partitions, and a partition may have multiple disks.
3. Partition balancing principle: Number of times that each disk appears in the partition = Number of times that each disk appears in memory
4. Reliability balancing principle: For node-level security, the number of disks on the same node in a partition cannot exceed the value of M in EC N+M.
10 Huawei Confidential
EC Expansion Balances Costs and Reliability
As the number of nodes increases during capacity expansion, the number of data blocks (the value of M in
EC) automatically increases, with the reliability unchanged and space utilization improved.
Ascending
... 8+2 80.00% 12+3 80.00% 12+4 75.00%
Data blocks (22) Parity blocks (2) 14+2 87.50% 18+3 85.71% 18+4 81.82%
When adding nodes to the storage system, the customer 18+2 90.00%
Data Data
Original Parity
data data
Process EC decoding
Process EC encoding 2+2
(EDS) (EDS) 2+2
Original Parity
data data
Disk1 Disk2 Disk 3 Disk 4 Disk 5 Disk 6 Disk1 Disk2 Disk3 Disk4 Disk5 Disk6
1. The EC can reconstruct valid data and return the data only after read 1. If a fault occurs, using EC reduction to write data can ensure the data write
degradation and verification decoding are performed. reliability. Assume that the original EC scheme is 4+2, when a fault occurs,
2. The reliability of data stored in EC mode decreases if a node becomes the data will be written to the storage system in EC 2+2 mode.
faulty. The reliability needs to be recovered through data reconstruction. 2. EC never degrades write operation, providing higher reliability.
12 Huawei Confidential
High-Ratio EC Aggregates Small Data Blocks and ROW Balances
Costs and Performance
LUN 0 LUN 1 LUN 2
4 KB 4 KB ... 8 KB 4 KB 4 KB ... 8 KB 4 KB 4 KB ... 8 KB
I/O aggregation
Linear space A B C D E ...
Data Write
Plog Data storage using Append Only Plog (ROW-based append write technology
for I/O processing)
A B C D E Full stripe P Q
Intelligent stripe aggregation algorithm + log appending: Reduces latency and achieves a high ratio of EC 22+2
Host write I/Os are aggregated into full EC stripes based on the write-ahead log (WAL) mechanism for system reliability.
The load-based intelligent EC algorithm writes data to SSDs in full stripes, reducing write amplification and ensuring the host write latency less than 500 μs.
The mirror relationship is created between the data in the Hash memory table and that in SSD media logs. After data is aggregated, random write is changed to
100% sequential write which writes data to back-end media, improving random write efficiency.
13 Huawei Confidential
Inline and Post-Process Adaptive Deduplication and
Compression
Data with low OceanStor 100D supports global deduplication and compression, as well as adaptive inline and
Opportunity deduplication
If inline Block 1 post-process deduplication. Deduplication reduces write amplification of disks before data is written
table ratios is filtered
deduplication
out by using the
to disks. Global adaptive deduplication and compression can be performed on all-flash SSDs and
fails or is HDDs.
HASH-B opportunity table.
skipped, post-
process Block 2 OceanStor 100D uses the opportunity table and fingerprint table mechanism. After data enters the
3 deduplication is cache, the data is broken into 8 KB data fragments. The SHA-1 algorithm is used to calculate 8 KB
enabled and data data fingerprints. The opportunity table is used to reduce invalid fingerprint space.
block fingerprints
enter the Block 3 HASH-C Adaptive inline and post-process deduplication: When the system resource usage reaches the
opportunity table. threshold, inline deduplication automatically stops. Data is directly written to disks for persistent
storage. When system resources are idle, post-process deduplication starts.
Before data is written to disks, the system enters the compression process. The compression is
aligned in the unit of 512 bytes. The LZ4 algorithm is used and deep compression HZ9 is supported.
Promote the
4 opportunity
table to a Media Deduplication and
Write data Service Type Impact on Performance
1 fingerprint Type Compression
blocks. table.
Direct data in the Bandwidth- Deduplication and
fingerprint table Reduced by 30%
intensive services compression enabled
5 after post-
Block1
process Fingerprint IOPS-intensive Deduplication and
deduplication. table Reduced by 15%
All-flash services compression enabled
Block 2
SSDs Bandwidth- Pure write increased by 50% and
HASH-B Compression enabled only
Block 3 intensive services pure read increased by 70%
Block B IOPS-intensive
The fingerprint Compression enabled only Reduced by 10%
table occupies a
services
Block 4 HASH-A little memory, Bandwidth- Pure write increased by 50% and
which supports Compression enabled only
intensive services pure read increased by 70%
Block 5 Direct data in Block A
deduplication of HDDs
2 the fingerprint large-capacity IOPS-intensive
table after inline systems. Compression enabled only None
services
deduplication.
14 Huawei Confidential
Post-Process Deduplication
Node 0 Node 1
Deduplication Data service Deduplication Data
service
2. Injection
Opportunity Opportunity
1. Preparation
table 4. Raw data read table
Address 3. Analysis Address
mapping mapping
table table
5. Promotion
Fingerprint Fingerprint
data table data table
6. Remapping
7. Garbage collection
15 Huawei Confidential
Inline Deduplication Process
3. Writing fingerprint
instead of data
16 Huawei Confidential
Compression Process
I/O data
After deduplication, data begins to be
compressed. After compression, the data
1. Breaking data into a fixed length
begins to be compacted.
Data of the
same size
Data compression can be enabled or
2. Compression
disabled as required based on the
LUN/FS granularity.
Data after
compression The following compression algorithms
B1 B2 B3 B4 B5 are supported: LZ4 and HZ9. HZ9 is a
Typical data
deep compression algorithm and its
layout compression ratio is 20% higher than
in storage
that of LZ4.
512 bytes 1.5 KB 2.5 KB 3.5 KB 4.5 KB 5 KB
The length of the compressed data
Waste space 512 bytes aligned,
wasting a lot of space varies. Multiple data blocks less than 512
Compression header, which bytes can be compacted, and then
B1 B2 B3 B4 B5 describes the start and
length of compressed data
stored and aligned in the storage space.
OceanStor
3. Data compaction
If the compressed data is not in 512-byte
100D data
layout aligned mode, zeros will be added to
512 bytes 1.5 KB 2.5 KB 3 KB improve the space utilization.
17 Huawei Confidential
Ultra-Large Cluster Management and Scaling
Compute Adding/Reducing compute nodes
cluster Compute node Compute node Compute node Compute node
• Compute node: 10,240 nodes supported by
TCP, 100 nodes supported by IB, and 768
VBS VBS VBS VBS
nodes supported by RoCE
Network switch
Storage node Storage node Storage node Storage node Storage node
18 Huawei Confidential
All-Scenario Online Non-Disruptive Upgrade
(No Impacts on Hosts)
Block mounting scenario: iSCSI scenario:
Compute node Host
iSCSI initiator
SCSI
Provides only interfaces and
forwards I/Os. The code is • Before the upgrade,
VSC start the backup
simplified and does not need
to be upgraded. process and
Back up the TCP Connect maintain the
VBS
Each node completes the connection. to the connection with the
upgrade within 5 seconds Restore the TCP backup
connection.
shared memory, and
and services are quickly process.
VBS Save the iSCSI complete the
Storage node Storage node taken over. The host connection and upgrade within 5
I/Os.
connection is uninterrupted, Shared seconds.
Restore the
EDS EDS but I/Os are suspended for 5 iSCSI connection memory • Single-path upgrade
seconds. and I/Os.
is supported. Host
Component-based upgrade: connections are not
Components without EDS interrupted, but I/Os
changes are not upgraded, are suspended for 5
minimizing the upgrade seconds.
OSD … OSD
duration and impact on OSD
... ... services.
...
19 Huawei Confidential x
Concurrent Upgrade of Massive Storage Nodes
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
node
node
node
node
node
node
node
node
node
node
node
node
... ... ... ...
Upgrade description:
• A storage pool is divided into multiple disk pools. Disk pools are upgraded concurrently, greatly shortening the upgrade duration of distributed
massive storage nodes.
• The customer can periodically query and update versions from the FSM node based on the configured policy.
• Upgrading compute nodes on a management node is supported. Compute nodes can be upgraded by running upgrade commands.
20 Huawei Confidential
Coexistence of Multi-Generation and Multi-Platform
Storage Nodes
• Multi-generation storage nodes (dedicated storage nodes of three consecutive generations and general storage nodes of two
consecutive generations) can exist in the same cluster but in different pools.
• Storage nodes on different platforms can exist in the same cluster and pool.
P100
C100
22 Huawei Confidential
CPU Multi-Core Intelligent Scheduling
Group- and Priority-based Multi-Core
CPU Intelligent Scheduling
Traditional CPU Thread
High-performance Kunpeng
Scheduling
multi-core CPU
CORE CORE CORE CORE CORE CORE CORE
0 1 2 3 4 5 ...
Fast
0 1 2 3 4 5 . processing
. and return of Group 1 Group 2 Group 3 Group N
. front-end
I/Os EDS OSD
EDS OSD Front-end Communication Communication
thread pool thread pool thread pool
Front Back-end Meta Merge Xnet Priority 1 Priority 1
Front Front Xnet
I/O scheduler
Xnet Replication Dedup OSD Back-end Metadata merge I/O thread pool
Priority 2
thread pool thread pool
• CPU grouping is unavailable, and frequent switchover of processes Back-end Meta Merge Xnet
among CPU cores increases the latency.
• CPU scheduling by thread is in disorder, decreasing the CPU Deduplication Replication
Priority 4 Priority 5 Background
efficiency and ultimate performance. thread pool thread pool
• Tasks with different priorities, such as foreground and background I/O adaptive
Dedup Replication speed
tasks, interruption and service tasks, conflict with each other,
adjustment
affecting performance stability.
Advantages of group- and priority-based multi-core CPU intelligent
scheduling:
• Elastic compute power balancing efficiently adapts to complex and diversified service
scenarios.
• Thread pool isolation and intelligent CPU grouping reduce switchover overhead and
provide stable latency.
23 Huawei Confidential
Distributed Multi-Level Cache Mechanism
Multi-level distributed read cache Multi-level distributed write cache
Read Request Hit return
Write WAL Log and
If Miss read from
Write Data Response quickly
Smart Cache
Aggregate and Flush
Semantic-level Data to SSD
1 µs RAM Cache RAM Cache
EDS Meta Cache
BBU RAM Write
I/O model Cache
Metadata Hit return
SCM intelligent RoCache
RAM Cache
10 µs recognition Future
engine SCM POOL
If Miss read from POOL, SSD
Semantic-level or HDD
SSD Cache Smart Cache
SSD Write Cache
100
µs Disk Read
Disk-level Aggregate and Flush
SSD Cache Cache Data to HDD
24 Huawei Confidential
EC Intelligent Aggregation Algorithm
Intelligent aggregation EC
Traditional Cross-Node EC
based on append write
LUN 1 LUN 2 LUN 1 LUN 2
Node1 Node2 Node3 Node4 Node 5 Node 6 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6
The Append Only Plog technology provides the optimal disk flushing performance model for media.
• Plog is a set of physical addresses that are managed based on a fixed size. The
upper layer accesses the Plog by using Plog ID and offset.
• Plog is append-only and cannot be overwritten.
Plog has the following performance
A B ...... A' ...... B' ...... advantages:
Logical address
overwrite
• Provides a medium-friendly large-block
sequential write model with optimal
Cache Linear ...... performance.
A B C D A' E F B'
space
• Reduces SSD global collection pressure by
Data sequential appending.
Write Plog ID + offset
Write new Plogs in
appending mode. • Provides a basis for implementing the EC
Physic Plog 2 ...... intelligent aggregation algorithm.
Plog 1 Plog 3
address space
26 Huawei Confidential
RoCE+AI Fabric Network Provides Ultra-Low Latency
27 Huawei Confidential
Block: DHT Algorithm + EC Aggregation Ensure Balancing and
Ultimate Performance
I/O
DHT algorithm DHT algorithm: (LBA/64 MB)%
......
Service Layer Granularity
• The storage node receives I/O data and distributes the data
(64 MB)
blocks of 64 MB to the storage node using the DHT algorithm.
Node-1 Node-2 Node-3 Node-4 Node-5 Node-6 Node-7
• After receiving the data, the storage system divides the data
Grain (e.g. 8 KB) into 8 KB data blocks of the same length, compresses the
deduplicated data at the granularity of 8 KB, and aggregates
the data.
• The aggregated data is stored to the storage pool.
Logical space of LUN1 LBA Logical space of LUN2 LBA
Index Layer Mapping Between LBAs and Grains of LUNs
LUN1-LBA1 Grain1
LUN1-LBA2 Grain2
LUN1-LBA3 Grain3
LUN2-LBA4 Grain4
Grain1
Partition ID
Persistency Layer Grain2
Four grains form
an EC stripe and
Grain3 are stored in the
D1 D2 D3 D4 P1 P2 D1 D2 D3 D4 P1 P2
partition.
Grain4
Node-1 Node-2 Node-3 Node-4 Node-5 Node-6 Node-7
28 Huawei Confidential
Object: Range Partitioning and WAL Submission Improve Ultimate
Performance and Process Hundreds of Billions of Objects
Range WAL
A, AA, AB, ABA, ABB, ............, ZZZ A, AA, AB, ABA, ABB, ............, ZZZ
…
...
Range 0 Range 1 Range 2 Range n
Node 1 Node 2 Node 3 Node n
Range Partition
29 Huawei Confidential
HDFS: Concurrent Multiple NameNodes Improve
Metadata Performance
Traditional HDFS NameNode model Huawei HDFS Concurrent Multi-NameNode
Hadoop Hadoop
HBase/Hive/Spark HBase/Hive/Spark compute node
compute node
HA based on NFS or
Quorum Journal
• Only one active NameNode provides the metadata service. The active and standby • Multiple active NameNodes provide metadata services, ensuring
NameNodes are inconsistent in real time and have a synchronization period. real-time data consistency among multiple nodes.
• If the current active NameNode becomes faulty, the new NameNode cannot provide the • Avoid metadata service interruption caused by traditional HDFS
metadata service until the new NameNode completes the log loading. The duration is up to NameNode switchover.
several hours. • The number of files supported by multiple active NameNodes is
• The number of files supported by a single active NameNode depends on the memory of a no longer limited by the memory of a single node.
single node. The maximum number of files supported by a single active NameNode is 100 • Multi-directory metadata operations are concurrently performed
million. on multiple nodes.
• When a namespace is under heavy pressure, concurrent metadata operations consume a
large number of CPU and memory resources, deteriorating performance.
30 Huawei Confidential
NAS: E2E Performance Optimization
Private client, distributed cache, and large I/O passthrough (DIO) technologies enable a storage system to
provide high bandwidth and high OPS.
Application Application
① Private client Standard protocol client
Front-end network Key Large Files, High Large Files and Random Massive Small Files,
Peer Vendors
Technology Bandwidth Small I/Os High OPS
Node 3
Protocol Protocol Protocol Multi-channel protocol Intelligent protocol load Small I/O aggregation Isilon and NetApp do
1. Private client
35% higher bandwidth balancing 40% higher IOPS than not support this
Read
② ⑤ (POSIX,
than common NAS 45% higher IOPS than common NAS function. DDN and IBM
cache MPIIO)
protocols common NAS protocols protocols support this function.
4. Block size of
Persistence Persistence Persistence 1 MB block size, 8 KB block size, 8 KB block size,
the self- Only Huawei supports
improving read and write improving read and write improving read and
adaptive this function.
performance performance write performance
application
Back-end network Large I/Os are directly
5. Large I/O read from and written to Only Huawei supports
/ /
passthrough the persistence layer, this function.
improving bandwidth.
31 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem
32 Huawei Confidential
Component-Level Reliability: Component Introduction and
Production Improve Hardware Reliability
Component
Joint test
500+ test cases:
System test
Start: authentication/Preliminary
System functions and ERT long-term
Design review test for R&D Qualification test
performance, samples/Analysis on reliability test I: System/disk logs
Disk selection
product applications Locate simple
compatibility with earlier
versions, disk system problems quickly.
reliability II: Protocol analysis
5-level FA analysis
Failure analysis Supplier Locate interactive
Circular improvements
tests/audits during the problems accurately.
About 1000 disks have supplier's production
been tested for three (Quality/Supply/Cooperation) III: Electrical signal
ERT
Based on advanced test algorithms and industry-leading full temperature cycle tests, firmware tests, ERT tests, and ORT tests,
Huawei distributed storage ensures that component defects and firmware bugs can be effectively intercepted and the overall
hardware failure rate is 30% lower than the industry level.
33 Huawei Confidential
Product-Level Reliability: Data Reconstruction Principles
Disk 1
Disk 1 Disk 1 Disk 1 Disk 1 Disk 1
Fault Disk 2
Disk 2 Disk 2 Disk 2 Disk 2 Disk 2
Disk 3
Disk 3 Disk 3 Disk 3 Disk 3 Disk 3
... ... Disk 4 ... ... ...
...
Disk N Disk N Disk N Disk N Disk N
Disk N
34 Huawei Confidential
Product-Level Reliability: Technical Principles of E2E
Data Verification
①WP: DIA insertion Two verification modes:
APP APP APP APP ②WP: DIA verification Real-time verification: Write requests are
③RP: DIA verification
⑤→④: Read repair from verified on the access point of the system
HDF
remote replication site (the VBS process). Host data is re-verified
Block① ③ NAS S3 ⑥→④: Read repair from
S other copy or EC check on the OSD process before being written to
calculation disks. Data read by the host is verified on the
SWITCH
VBS process.
Periodical background verification: When
the service pressure of the system is low, the
HDFS
system automatically enables the periodical
background verification and self-healing
EDS ⑤ HyperMetro functions.
⑥ Three verification mechanisms: CRC 32
② ④ OSD
protects users' 4 KB data blocks. In addition,
OceanStor 100D
OceanStor 100D supports host and disk LBA
logical verification to optimize all silent data
scenarios, such as transition, read offset, and
Disk
write offset.
Two self-healing mechanisms: local redundancy
64 Bytes
4 KB Data Block Data mechanism and active-active redundancy data
512 512 512 512 512 512 512 512 512
Integrity Block, NAS, and HDFS support E2E verification,
Area but Object does not support this function.
35 Huawei Confidential
Product-Level Reliability: Technical Principles of
Subhealth Management
1. Disk sub-health management 3. Process/Service sub-health management
Intelligent detection and diagnosis: Cross-process/service detection 1: If
the I/O access latency exceeds the
Information about Self Monitoring ② specified threshold, an exception is
OSD Analysis and Reporting Technology
① (SMART), statistical I/O latency, real- reported.
time I/O latency, and I/O errors is Smart diagnosis 2: OceanStor 100D
EDS EDS
② collected. Clustering and slow-disk ① ① ③
diagnoses processes or services with
detection algorithms are used to abnormal latency using the majority
RAID Card MDC diagnose abnormal disks or RAID voting or clustering algorithm based on
MDC the reported abnormal I/O latency of
controller cards.
Isolation and warning: After diagnosis, each process or service.
the MDC is instructed to isolate Isolation and warning 3: Report
involved disks and report an alarm. OSD abnormal processes or services to the
DISK MDC for isolation and reports an
alarm.
Write key
information in
advance.
Reboot
ZK
Node entry
Power-off process
System Troubleshooting
notification Node fault
System reset Fault
broadcast
Nodes entry notification
1822 NIC 1822 NIC CM
process
Troubleshooting
Unexpected power-off
Power-off interruption
Power off the node by MDC
pressing the power
button.
Troubleshooting
System-insensitive Reset interruption
reset due to hardware
faults
37 Huawei Confidential
Solution-Level Reliability: Gateway-Free Active-Active Design
Achieves 99.9999% Reliability (Block)
38 Huawei Confidential
Key Technologies of Active-Active Consistency
Assurance Design
Data center A Data center B
In normal cases, the write I/Os are written to both sites concurrently
before it is returned to the hosts to ensure data consistency.
Host Application Host
cluster
The optimistic lock mechanism is used. When the LBA locations of
Cross-site active-active I/Os at the two sites do not conflict, the I/Os are written to their own
3. Sends I/O2. 1. Sends I/O1.
cluster locations. If the two sites write data to the same LBA address
HyperMetro LUN space that overlaps with each other, the data is forwarded to one of
4. Performs 5. Forwards I/O2 2. Performs dual-write for
the sites for serial write operations to complete lock conflict
dual-write for
I/O2. A conflict
I/O1 and adds a local lock to
the I/O1 range at both ends.
processing, ensuring data consistency between the two sites.
is detected
39 Huawei Confidential
Cross-Site Bad Block Recovery
41 Huawei Confidential
Data Synchronization and Difference Recording Mechanisms
Summary: Second-level RPO and asynchronous replication without differential logs (ROW mechanism) are
supported, helping customers recover services more quickly and efficiently.
42 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem
43 Huawei Confidential
Out-of-the-Box Operation
Install the hardware.
44 Huawei Confidential
AI-based Forecast
1. eService uses massive data to train AI algorithms online, and mature algorithms are adapted to devices.
2. AI algorithms that do not require large-data-volume training are self-learned in the device.
Result output
AI Algorithms
2
Risky disk Run AI algorithms on devices based on disk logs to forecast
forecast risks half a year in advance.
Data input
Disk
Capacity
information
collection
collection
DISK
45 Huawei Confidential
Hardware Visualization
Visualized Hardware-based modeling and layered display of global hardware enable second-
hardware level hardware fault locating.
46 Huawei Confidential
Network Visualization
Visualized Based on the network information model, collaborative network management is
networks achieved, supporting network planning, fault impact analysis, and fault locating.
47 Huawei Confidential
Three-Layer Intelligent O&M
Data Center eService Cloud
OceanStor DJ Smartkit eSight
Storage resource control Storage service tool Storage monitoring and
management eService
Fault Performance
Resource Service Delivery Upgrade Troubleshooting monitoring report
Intelligent Intelligent
provisioning planning maintenance analysis
Storage Correlation
Log analysis Inspection tool subnet analysis and platform platform
topology forecast
48 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem
49 Huawei Confidential
Flexible and General-Purpose Storage Nodes
50 Huawei Confidential
New-Generation Distributed Storage Hardware
Front view Rear view
High-density
HDD
(Pacific)
5 U, 2 nodes, 60 HDDs per node, 120 Two I/O cards per node, four onboard
HDDs in total 25GE ports
High-density
all-flash
(Atlantic)
5 U, 8 nodes, 10 SSDs per node, 80 Two I/O cards per node, a total of 16 I/O cards
SSDs in total Two switch boards, supporting eight 100GE
ports and eight 25GE ports
51 Huawei Confidential
Pacific Architecture Concept: High Disk Density, Ultimate TCO, Dual-Controller
Switchover, High Reliability, Smooth Data Upgrade, and Ever-New Data
• Two nodes with 120 disks provide 24 disk slots per U, with industry-leading density and the lowest TCO per disk slot in the industry.
• The dual-controller switchover design of vnode eliminates data reconstruction in the case of controller failures, ensuring service continuity and
high reliability.
• FRU design for entire system components, independent smooth evolution for each subsystem, one entire system for ever-new data
PCIe card
PCIe card
PCIe card
PCIe card
25G
25G
GE
GE
1*
4*
4*
1*
Controller
FAN
SAS/PCIE SAS/PCIE
PWR
Exp Exp Exp Exp Exp Exp Exp Exp
System disk
System disk
Fan module
Fan module
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
BBU
BBU
BBU
X15 X15 X15 X15 X15 X15 X15 X15
Expander
52 Huawei Confidential
Pacific VNode Design: High-Ratio EC, Low Cost, Dual-Controller
Switchover, and High Reliability
• Vnode design supports large-scale EC and provides over 20% cost competitiveness.
• If a controller is faulty, the peer controller of the vnode takes over services in seconds, ensuring service continuity and improving reliability and availability.
• EC redundancy is constructed based on vnodes. Areas affected by faults can be reduced to fewer than 15 disks. A faulty expansion board can be replaced online.
Secondary controller service takeover in the case of a faulty controller Vnode-level high-ratio EC N+2
Enclosure 1 Enclosure 2
X
Compute virtual Compute virtual Compute virtual Compute virtual
unit unit unit unit
1620 1620
VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD VNOD
E E E E E E E E E E E E E E E E
SAS/PCIE
Single-controller SSD Single-controller SSD
VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1 VNODE 1
SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD SSD
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
HALFPALM
15 x HDDs 15 x HDDs
X15 X15 X15 X15 X15 X15 X15 X15
Dual-controller takeover
1. When a single node is faulty, SSDs are taken over by the secondary controller physically
SAS
1. The SSDs of a single controller are distributed to four vnodes based on the expander
within 10 seconds, and no reconstruction is required. Compared with common hardware, the
unit. A large-ratio EC at the vnode level is supported.
fault recovery duration is reduced from days to seconds.
2. At least two enclosures and 16 vnodes are configured. The EC ratio can reach 30+2,
2. The system software can be upgraded online and offline without service interruption.
which is the same as the reliability of general-purpose hardware.
3. Controllers and modules can be replaced online.
3. Expansion modules (including power supplies) and disk modules can be replaced
online independently.
53 Huawei Confidential
Atlantic Architecture: Flash Storage Native Design with
Ultimate Performance, Built-in Switch Networking, Smooth
Upgrade, and Ever-New Data
• Flash storage native design: Fully utilize the performance and density advantages of SSDs. The performance can reach 144 Gbit/s and up to 96 SSDs are supported, ranking
No.1 in the industry.
• Simplified network design: The built-in 100 Gbit/s IP fabric supports switch-free Atlantic networking and hierarchical Pacific networking, simplifying network planning.
• The architecture supports field replaceable unit (FRU) design for entire system components and independent smooth evolution f or each subsystem, achieving one entire system
for ever-new data.
54 Huawei Confidential
Native Storage Design of Atlantic Architecture
Native all-flash design:
• Matching between compute power and SSDs: Each Kunpeng processor connects to a small number of SSDs to fully utilize SSD
performance and avoid CPU bottlenecks.
• Native flash storage design: Flash-oriented half-palm SSD with a slot density 50% higher than that of a 2.5' disk and more efficient heat
dissipation
Traditional architecture design Atlantic architecture is designed for high-performance and high-
density application scenarios.
1. The ratio of CPUs to flash
Half PALM SSD
storage is unbalanced,
1. The parallel backplane design improves the
causing insufficient
backplane porosity by 50% and reduces power
utilization of the flash
consumption by 15%. The compute density
storage.
and performance are improved by 50%.
2. A 2 U device can house a
maximum of 25 SSDs, causing
the difficulty in improving the 2. The half-palm design for SSDs reduces the
medium density. thickness, increases the depth, and
3. The holes on the vertical improves the SSD density by over 50%.
backplane are small,
requiring more space for
CPU heat dissipation. 3. Optimal ratio of CPUs to SSDs: A single
4. 2.5-inch NVMe SSDs are enclosure supports eight nodes, improving
evolved from HDDs and are storage density and compute power by 50%.
not dedicated to SSDs.
55 Huawei Confidential
Atlantic Network Design: Simplified Networking Without
Performance Bottleneck 4 nodes, 100GE per node 4 nodes, 100GE per node
BMC
Switch-free design:
• 24-node switch-free design eliminates network bandwidth bottlenecks.
• With switch-free design, switch boards are interconnected through 8 x 6603 6603 Adopt the
100GE ports. Pacific
architecture to
• The passthrough mode supports switch interconnection. The switch implement
board ports support the 1:1, 1:2, or 1:4 bandwidth convergence ratio 4 x 25GE 4 x 25GE tiered storage.
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
100GE 100GE
100GE 100GE 100GE 100GE
100GE 100GE
100GE 100GE 100GE 100GE
4*25GE
4* 50GE 4* 50GE
4*25GE
4* 50GE 4* 50GE
SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603 SD6603
100GE
100GE
QSFP-DD
QSFP-DD
4 x 100GE 4 x 100GE 2x
2x 2x
100 100 100
4 x 100GE GE GE GE
64 port 100GE Switch
56 Huawei Confidential
OceanStor 100D Ultimate Ultimate Scenario and
Block+File+Object+HDFS Overview Performance Usability Ecosystem
57 Huawei Confidential
Block Service Solution: Integration with the VMware Ecosystem
58 Huawei Confidential
Block Service Solution: Integration with OpenStack
OceanStor 100D Cinder Driver architecture: OpenStack P release Cinder Driver API(Must Core-Function):
OpenStack
OpenStack P release
Cinder API
OpenStack volume driver API OceanStor 100D
package create_volume Yes
Cinder Scheduler (OpenStack)
delete_volume Yes
Cinder Volume create_snapshot Yes
delete_snapshot Yes
STaaS or eSDK
get_volume_stats Yes
59 Huawei Confidential
Block Service Solution: Integration with Kubernetes
Kubernetes Master
Process for Kubernetes to use the OceanStor 100D CSI
driver-register plug-in to provide volumes:
① ②
CSI plugin ① The Kubernetes Master instructs the CSI plug-in to
external-provisioning create a volume. The CSI plug-in invokes an
OceanStor 100D interface to create a volume.
② The Kubernetes Master instructs the CSI plug-in to
map the volume to the specified node. The CSI plug-in
Kubernetes Node
invokes the OceanStor 100D interface to map the
volume to the specified node host.
driver-register Container SCSI/iSCSI
OceanStor 100D ③ The target Kubernetes node to which the volume is
/mnt mapped instructs the CSI plug-in to mount the volume.
external-attaching The CSI plug-in formats the volume and mounts the
volume to the specified directory of the Kubernetes.
CSI plugin Block dev
③
CSI plug-in assistance service provided by the Kubernetes community
Kubernetes Node Deployed on all Kubernetes nodes based on the CSI specifications to
CSI plugin complete volume creation/deletion, mapping/unmapping, and
driver-register external-attaching mounting/unmounting.
Management plane
CSI plugin Data plane
60 Huawei Confidential
NAS HPC Solution: Parallel High Performance, Storage and Compute
Collaborative Scheduling, High-Density Hardware, and Data Tiering
HPC Application Compute Farm(50000+) Parallel high-performance client
I/O Libraries • Compatible with POSIX and MPI-IO: Meets the requirements of the
Parallel
high-performance compute application ecosystem and provides the
MPI-IO
Serial ...... MPI optimization library to contribute to the parallel compute
community.
POSIX Login & Job • Parallel I/O for high performance: 10+ Gbit/s for a single client and
Parallel File System 4+ Gbit/s for a single thread
Scheduler
Client • Intelligent prefetch and local cache: Supports cross-file prefetch and
local cache, and meets high-performance requirements of NDP.
Front-end • Large-scale networking: Meets the requirements of over 10,000
compute nodes.
network
Collaborative scheduling of storage and compute resources
• Huawei scheduler: service scheduling and data storage
collaboration, data pre-loading to SSDs, and local cache
OceanStor 100D File • QoS load awareness: I/O monitoring and load distribution avoid
scheduling based on compute capabilities and improve overall
Rack1 Rackn Rackn+1 Rackm compute efficiency.
High-density customized hardware
• All-flash Atlantic architecture: 5 U, 8 controllers, 80 disks, 70 Gbit/s
IOPS per enclosure @ 1.6 million IOPS
• High-density Pacific architecture: 5 U 120 disks, supporting online
... ... maintenance and providing a high ratio EC to ensure high disk rate.
• Back-end network free: With the built-in switch module, the back-end
network does not need to be deployed independently if the number
of enclosures is less than 3.
Data tiering for cost reduction
• Single view and multiple media pools: Ensure that the service view
will not be changed and applications will not be affected by the
change.
SSD Pool HDD Pool • Traditional burst buffer free: The all-flash optimized file system
simplifies the deployment of a single namespace and provides
Back-end higher performance especially for metadata access.
network
61 Huawei Confidential
HDFS Service Solution: Native Semantics, Separation of Storage and
Compute, Centralized Data Deployment, and On-Demand Expansion
FusionInsight HORTONWORKS HBase Hive Cloudera FusionInsight HORTONWORKS HBase Hive Cloudera
Management node Compute nodes Compute nodes Compute node Compute nodes
CPU CPU CPU
Memory Memory
......
Memory Native HDFS semantics
Management node Storage Storage Storage
system system system Distributed storage cluster
Compute/Storage Compute/Storage Compute/Storage
node node node ......
• Based on the general x86 server and Hadoop software, a • OceanStor 100D HDFS is used to replace the local HDFS
compute/storage node is used to access the local HDFS. of Hadoop. Using native semantics for interconnection,
The compute and storage resources can be expanded storage and compute resources are decoupled. Capacity
concurrently. expansion can be performed independently, facilitating on-
demand expansion of compute and storage resources.
62 Huawei Confidential
Object Service Solution: Online Aggregation of Massive
Small Objects Improves Performance and Capacity
When the storage file object is smaller than the system strip, a large number of space fragments are generated,
which greatly affects the space usage and access efficiency.
Object data aggregation
Obj1 Obj2 Obj3 Obj4 Obj5 Obj6
• Incremental EC aggregates small objects
... into large objects without performance loss.
• Reduce storage space fragments in massive
Strip1 Strip2 Strip3 Strip4 Parity 1 Parity 2 small file storage scenarios, such as
EC 4+2 is used as an example.
government big data storage, carrier log
512K
retention, and bill/medical image archiving.
Performs incremental EC after small objects are aggregated into large objects
• Improve the space utilization of small objects
from 33% (three copies) to over 80% (12+3).
• SSD cache is used for object aggregation,
SSD Cache SSD Cache improving the performance of a single node
by six times (PUT 3000 TPS per node).
......
... ...
Node 1 Node 2
63 Huawei Confidential
Unstructured Data Convergence Implements
Multi-Service Interworking
O&M Scenario: gene sequencing/oil Scenario: log retention/operation Scenario: backup and
• Converged storage pool
survey/EDA/satellite remote analysis/offline analysis/real-time archival/resource pool/PACS/check
Management sensing/IoV search image Three types of storage
services share the same
storage pool, reducing
NAS HDFS S3 initial deployment costs.
64 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.