Designing Highly Scalable OLTP Systems

High Scale OLTP
Lessons Learned from SQLCAT

Performance Labs
Ewan Fairweather:
Microsoft
Program Manager
Session Objectives and Takeaways

Session Objectives:
Learn about SQL Server capabilities and challenges experienced by some of our
extreme OLTP customer scenarios.
Insight into diagnosing and architecting around issues with Tier-1, mission
critical workloads.
Key Takeaways
SQL Server can meet the needs of many of the most challenging OLTP scenarios
in the world.
There are a number of new challenges when designing for high end OLTP
systems.
Laying the foundation and tuning for OLTP
Laying the foundation and tuning for OLTP workloads:

Understand goals and attributes of workload
Performance requirements
Machine born data vs. User driven solution
Read-Write ratio
HA/DR requirements which may have an impact
Apply Configuration and Best Practices guidance

Database and data file considerations
Transaction Log sizing and placement
Configuring the SQL Server Tempdb Database
Optimizing memory configuration
Be familiar with common performance methodologies,

toolsets and common OTLP / Scaling performance pain points
Know your environment Understand hardware is key
Hardware
Setup
Database
files
Database Files
# should be at least 25% of CPU cores
This alleviates PFS contention PAGELATCH_UP
There is no signficant point of diminishing returns up to 100% of CPU cores
But manageability, is an issue...
Though Windows 2008R2 is much easier
TempDb
PFS contention is a larger problem here as its an instance wide resource
Deallocations and Allocations , RCSI version store, triggers, temp tables
# files shoudl be exactly 100% of CPU Threads
Presize at 2 x Physical Memory
Data files and TempDb on same LUNs

Its all random anyway dont sub-optimize
IOPS is a global resource for the machine. Goal is to avoid PAGEIOLATCH on any data file
Key Takeaway: Script it! At this scale, manual work WILL drive you
insane
Special Consideration: Transaction Log

Transaction log is a set of 127 linked buffers with max 32
outstanding IOPS
Each buffer is 60KB
Multiple transactions can fit in one buffer
BUT: Buffer must flush before log manager can signal a commit OK
Pre-allocate log file

Use dbcc loginfo for existing systems
Example: Transaction log throughput was ~80MB/sec
But we consistently got <1ms latency, no spikes!
Initial Setup: 2 x HBA on dedicated storage port on RAID10 with 4+4
When tuning for peak: SSD on internal PCI bus (latency: a few s)
Key Takeway: For Transaction Log, dedicate storage components

and optimize for low latency
SQL Server Memory Setup

For large CPU/Memory box, Lock Pages in Memory really
matters
We saw more than double performance
Use gpedit.msc to grant it to SQL Service account
Consider TF834 (Large page Allocations)

On Windows 2008R2 previous issues with this TF are fixed
Around 5-10% throughput increase
Increases startup time
Beware of NUMA node memory distribution

Set max memory close to box max if dedicated box available
How we think about tuning

Let the workload access patterns guide you
Observe resource consumption and resource waits
http://sqlcat.com/whitepapers/archive/2007/11/19/sql-server-2005-waits-and-queues.aspx
http://
sqlcat.com/whitepapers/archive/2009/04/14/troubleshooting-performance-problems-in-sql-serve
r-2008.aspx
Standard tuning always applies (indexes, TSQL, etc)

On these systems we always watch for concurrency related
bottlenecks and key components which affect throughput
Locking, latching, spinlocks, log latency, etc.
*Focus of tuning depends on the workload, foundation areas can

bubble to the top. Focus on the 20% of issues that will give 80%
In this talk we will focus on
of optimization
the unique challenges we face
on high concurrency and
applications requiring low
latency
Laying the foundations for OLTP Performance

The Hardware Plays a Big Role It is critical understand
the theoretical capabilities of the systems in order to

succeed
Understand server architecture (NUMA, PCI layout, etc)
Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect Sysinternals CoreInfo
Network Card Tuning is often needed for throughput intensive

workloads
Storage Never go in blind! Knowing only its a SAN will lead to
disaster.
Understand and document all components in the path from the server to the disk (HBAs,
PCI, network, connectivity on the array, disk configuration, are the resources shared, etc..)
Test the storage before running SQL workload
Upping the Limits

Previous (before 2008R2) windows was limited to 64 cores
Kernel tuned for this config
With Windows Server 2008R2 this limit is now upped to

256 Cores (plumbing for 1024 cores)
New concept: Kernel Groups
A bit like NUMA, but an extra layer in the hierarchy
SQL Server generally follows suit but for now, 256 Cores
is limit on R2
Example x64 machines: HP DL980 (64 Cores, 128 in
HyperThread). IBM 3950 (up to 256 Cores)
And largest IA-64 is 256 Hyperthread (at 128 Cores)
The Path to the Sockets

Windows OS
Hardware
NUMA 6
Kernel
Group 0
Kernel
Group 1
NUMA 0
NUMA 2
NUMA 4
NUMA 6
NUMA 1
NUMA 3
NUMA 5
NUMA 7
NUMA 8
NUMA 10
NUMA 12
NUMA 14
NUMA 9
NUMA 11
NUMA 13
NUMA 15
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
NUMA 7
Kernel
Group 2
NUMA 16
NUMA 17
NUMA 19
NUMA 21
NUMA 23
Kernel
Group 3
NUMA 24
NUMA 26
NUMA 28
NUMA 30
NUMA 25
NUMA 27
NUMA 29
NUMA 31
NUMA 18
NUMA 20
NUMA 22
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
SQL Server Today: Capabilities and

Challenges with real customer workloads
Case Study: Large Healthcare Application

Application: Patient care application (workflow, EMR, etc)
Performance:
Sustain 9,500 concurrent application users with acceptable response time &
total CPU utilization 15,000 planned for March/April 2011 with ultimate goal
of 25,000+
Workload Characteristics:
6,000-7,000 batches/sec with a Read/Write ratio of about 80/20
Highly normalized schema, lots of relatively complex queries (heavy on loop
joins), heavy use of temporary objects (table valued functions), use of BLOBs,
transactional and storage based replication
Hardware/Deployment Configuration (Benchmark):

24 Application Servers, 12 Load generators (LoadRunner)
Database servers: DL980 and IBM 3950 (2 node single SQL Server failover
cluster instance)
Case Study: Large Healthcare

Application (cont.)
Other Solution Requirements:
Require zero data loss (patient data)

Use Synchronous IO SAN based replication for DR
This means we have to tolerate some transaction log overhead (3-5ms)
Application connections must run with lowest privileges possible

Application audits all access to patient data
Near real time reporting required (transactional replication used to
scale-out)
Observation
x64 Servers provide >2x per core processing over previous IA64 CPUs
HealthCare Application - Technical Challenges
Challenge
Consideration/Workaround
Network
10 Gb/s network used no bottlenecks observed
Concurrency
Observed spikes in CPU at random times during workload

Significant spinlocks contention on SOS_CACHESTORE due to frequent re-generation of
security tokens
Hotfix provided by SQL Server team
Result SOS_CACHESTORE contention removed
Spinlock contention on LOCK_HASH due to heavy reading of same rows
This was due to an incorrect parameter being passed in by test workload
Result LOCK_HASH contention removed, reduced CPU from 100% to 18%
Transaction Log
Synchronous replication at the storage level

Observed 10-30ms for log latency expected 3-5ms
Encountered Virtual Log File fragmentation (dbcc loginfo) rebuilt log
Observed overutilization of front end fiber channel ports on array - reconfigured
storage balancing traffic across front end ports
Result: 3-5ms latency
Database and table

design/Schema
Schema utilizes hash partitioning to avoid page latch contention on inserts

Requirement for low privileges requires longer code paths in the SQL engine
Monitoring
Heavily utilized Extended Events to diagnose spinlock contention points
Architecture/Hardware
Currently running 16 socket IA64 in production

Benchmark performed on 8 socket x64 Nehalem-EX (64 physical cores)
Hyper-threading to 128 logical cores offered little benefit to this workload
Encountered high NUMA latencies (coreinfo.exe) resolved via firmware updates
NUMA latencies
Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
Nehalem-EX
How fast is your interconnect.
Log Growth and Virtual Log File Fragmentation

SQL Server physical transaction log is comprised of Virtual Log Files (VLFs)
Each auto-growth/growth event will add additional VLFs
Frequent auto-growths can introduce a large number of VLFs which can have
a negative effect on log performance due to:
1. Overhead of the additional VLFs
2. File system fragmentation
Additional information can be found here

Consider rebuilding log if you find 100s or 1,000s of VLFs
DBCC LOGINFO can be used to report on this (example below)
FileId
FileSize
StartOffset
FSeqNo
Status
Parity CreateLSN
----------- -------------------- -------------------- ----------- ----------- ------ --------------------------------------2
253952
8192
48141
0
64
0
2
427556864
74398826496
0
0
128 22970000047327200649
2
427950080
74826383360
0
0
128 22970000047327200649
I N T E R N A L O N LY
Spinlocks
Lightweight synchronization primitives used to protect access to data
structures
Used to protect structures in SQL such as lock hash tables (LOCK_HASH),
security caches (SOS_CACHESTORE) and more
Used when it is expected that resources will be held for a very short duration
Why not yield?
It would be more expensive to yield and context switch than spin to acquire the
resource
Threads accessing the same
hash bucket of the table
are synchronized
LO
H
_
CK
H
S
A
Resourc
e
Hash
of lock
maint
Lock Manager
a
table ined in
ha sh
Lock Hash
Table
Thread attempts to
obtain lock (row,
page, database,
etc..
Spinlocks Diagnosis
1
1
select * from sys.dm_os_spinlock_stats

order by spins desc
2
2
3
3
5
5
These symptoms may indicate spinlock

contention:
1. A high number of spins is reported for a
particular spinlock type.
AND
2. The system is experiencing heavy CPU
utilization.
AND
3. The system has a high amount of concurrency.
4
4
Spinlock Diagnosis Walk Through

3
3
1
1
Extended events capture the backoff

events over a 1 min interval & provide
the code paths of the contention
security check related
Not a resolution but we know where
to start
Much higher CPU

with drop in
throughput (At
this point many
SQL threads are
spinning)
2Confirmed
2
theory via dm_os_spinlock_stats observe this

type with highest spins & backoffs
High backoffs = contention
Name
SOS_CACHESTORE
SOS_SUSPEND_QUEUE
LOCK_HASH
MUTEX
SOS_SCHEDULER
Collisions
14,752,117
69,267,367
5,765,761
2,802,773
1,207,007
Spins
942,869,471,526
473,760,338,765
260,885,816,584
9,767,503,682
3,692,845,572
Spins_Per_Collision
63,914
6,840
45,247
3,485
3,060
Backoffs
67,900,620
2,167,281
3,739,208
350,997
109,746
Spinlock Walkthrough Extended Events

Script
--Get the type value for any given spinlock type
select map_value, map_key, name from sys.dm_xe_map_values
where map_value IN ('SOS_CACHESTORE')
--create the even session that will capture the callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action (package0.callstack)
where
type = 144
--SOS_CACHESTORE
)
add target package0.asynchronous_bucketizer (
set filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)
--Ensure the session was created
select * from sys.dm_xe_sessions
where name = 'spin_lock_backoff'
--Run this section to measure the contention
alter event session spin_lock_backoff on server state=start
--wait to measure the number of backoffs over a 1 minute period
waitfor delay '00:01:00'
--To view the data

--1. Ensure the sqlservr.pdb is in the same directory as the
sqlservr.exe
--2. Enable this trace flag to turn on symbol resolution
DBCC traceon (3656, -1)
--Get the callstacks from the bucketize target
select event_session_address, target_name, execution_count, cast (target_data as XML)
from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)
where xs.name = 'spin_lock_backoff'
--clean up the session
alter event session spin_lock_backoff on server state=stop
A complete walkthrough
of the technique can be
found here:
http://sqlcat.com/msdnmirror/archive
/2010/05/11/resolving-dtc-relatedwaits-and-tuning-scalability-ofdtc.aspx
Regeneration of Security Tokens Result in High

SOS_CACHESTORE Spins
At Random Times CPU spikes, then almost all sessions waiting on LCK_M_X
Huge
increase in number of spins & backoffs associated with SOS_CACHESTORE
2
Observation: It is counterintuitive to have high waits times (LCK_M_X) correlate

with heavy CPU This is the symptom not the cause
Approach: Use extended events to profile the code path with the spinlock contention (i.e.
where there is a high number of backoffs)
Root cause: Regeneration of security tokens exposes contention in code paths for
access permission checks
Workaround/Problem Isolation: Run with sysadmin rights
Long Term Changes Required: SQL Server fix
Fully Qualified Calls To Stored Procedures

Developer uses Exec myproc for dbo.myproc
SQL acquires an exclusive lock LCK_M_X and prepares to compile the
procedure; this includes calculating the object ID
dm_exec_requests revealed almost all the sessions were waiting
on LCK_M_X to compile a stored procedure
Workaround: make app user DB_Owner
Case Study: Point of Sale (POS) System

Application: Point of Sale application supporting sales at 8,000 stores
Performance:
Sustain expected peak load of ~230 business transactions (checks) per second
230 business transactions = ~50,000 batches/sec
Heavy insert into a few tables, periodic range scans of newly added data
Heavy network utilization due to inserts and use of BLOB data
Hardware/Deployment Configuration:
Custom test harness, 12 Load Generators, 5 Application servers
Database servers: HP DL 785
48 Physical cores, 256GB RAM
Case Study: Point of Sale (POS) System (cont.)

Mission critical to the business in terms of performance and
availability.
Strict uptime requirements.
SQL Server Failover Clustering for local (within datacenter) availability
Storage based replication (EMC SRDF) for disaster recovery
Quick recovery time for failover is a priority.
Observation
Initial tests showed low overall system utilization
Long duration for insert statements
High waits on buffer pages (PAGELATCH_EX/PAGELATCH_SH)
Network bottlenecks once the latch waits were resolved
Recovery times (failure to DB online) after failover under full load were between 45
seconds and 3 minutes for unplanned node failures
POS Benchmark Configuration

BL460 Blade Servers
Dell R900s , R805s
Active/Active Failover cluster
Network switch
switch
Network
Transaction DB
Server
1 x DL785
8P (quad core),
2.3GHz
256 GB RAM
12 x Load drivers:
2 proc (quad core),
x64
32+ GB memory
DL785
DL585
SAN switch
Switch
Switch Brocade 4900
(32-ports active)
5 x App
servers:
Switch
Reporting DB
Server
1 x DL585
4P (dual core), 2.6
GHz
32 GB RAM
5 x BL460
2 proc (quad
core), 32bit
32 GB memory
SAN
CX-960
(240 drives,
15K, 300GB)
Technical Challenges and Architecting for Extreme

Challenge
OLTP
Network
CPU bottlenecks for network processing were observed and resolved via
Concurrency
network tuning (RSS)

Further network optimization was performed by implementing compression in
the application
After optimizations were able to push ~180K packets/sec, approx 111 MB/sec
through a single 1 Gb/s NIC.
Page buffer latch waits were by far the biggest pain point
Hash partitioning was used to scale-out the btrees and eliminate the
contention
Some PFS contention for the tables containing LOB data resolved by placing LOB tables
on dedicated filegroups and adding more files
Transaction Log
No log bottlenecks were observed. When cache on the array behaves well log response
times are very low.
Database and table

design/Schema
Observed overhead related to PK/FK relationships. Insert statements required additional

work.
Adding persisted computed column needed for hash partitioning is an offline operation.
Moving LOB data is an offline operation.
Monitoring
For the latch contention, utilized dm_os_wait_stats, dm_os_waiting_tasks and

dm_db_index_operational_stats to identify indexes with most contention
Be careful about shared components in Blade server deployments this became a

bottleneck for our middle tier.
Hot Latches!
We observed very high waits for
PAGELATCH_EX
High = more than 1ms, we observed
greater than 20 ms
Be careful drawing conclusions just on
averages
What are we contending on?

Latch a light weight semaphore
Locks are logical (transactional
consistency)
Latches are physical (memory consitency)
Because rows are small (many fit a

page) multiple threads accesses
single page may compete for one
PAGELATCH even if there is no
lock blocking
Page (8K)
EX_LATCH
ROW
ROW
ROW
298
IX Page
INSERT VALUES
(298, xxxx)
EX_LATCH wait
EX_LATCH
ROW
299
IX Page
INSERT VALUES
(299, xxxx )
Waits & Latches

Dig into details with:
sys.dm_os_wait_stats
sys.dm_os_latch_waits
wait_type
% Wait Time
PAGELATCH_EX
86.4%
PAGELATCH_SH
8.2%
LATCH_SH
1.5%
LATCH_EX
1.0%
LOGMGR_QUEUE
0.9%
CHECKPOINT_QUEUE
0.8%
ASYNC_NETWORK_IO
0.8%
WRITELOG
0.4%
latch_class
wait_time_ms
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
156,818
LOG_MANAGER
103,316
I N T E R N A L O N LY
Waits & Latches Server Level

select *
, wait_time_ms/waiting_tasks_count [avg_wait_time]
, signal_wait_time_ms/waiting_tasks_count
[avg_signal_wait_time]
from sys.dm_os_wait_stats
where wait_time_ms > 0
and wait_type like '%PAGELATCH%'
order by wait_time_ms desc
Waits & Latches Index Level

sys.dm_db_index_operational_stats
/* latch waits
********************************************/
select top 20
database_id, object_id, index_id, count(partition_number) [num partitions]
,sum(leaf_insert_count) [leaf_insert_count], sum(leaf_delete_count) [leaf_delete_count]
,sum(leaf_update_count) [leaf_update_count]
,sum(singleton_lookup_count) [singleton_lookup_count]
,sum(range_scan_count) [range_scan_count]
,sum(page_latch_wait_in_ms) [page_latch_wait_in_ms], sum(page_latch_wait_count) [page_latch_wait_count]
,sum(page_latch_wait_in_ms) / sum(page_latch_wait_count) [avg_page_latch_wait]
,sum(tree_page_latch_wait_in_ms) [tree_page_latch_wait_ms], sum(tree_page_latch_wait_count)
[tree_page_latch_wait_count]
,case when (sum(tree_page_latch_wait_count) = 0) then 0
else sum(tree_page_latch_wait_in_ms) / sum(tree_page_latch_wait_count) end
[avg_tree_page_latch_wait]
from sys.dm_db_index_operational_stats (null, null, null,

null) os
where page_latch_wait_count > 0
group by database_id, object_id, index_id
Hot Latches - Last Page Insert Contention

Most common for indexes which
have monotinically increasing key
values (i.e. Datetime, identity, etc..)
BBtree
tree
Pag
Pag
e
e
BBtree
tree
Pag
Pag
e
e
Our scenario
Two tables were insert heavy, by far
receiving the highest number of
inserts
INSERT mainly however there is
background process reading off
ranges of the newly added data
And dont forget
We have to obtain latches on the
non-leaf Btree pages as well.
Page Latch waits vs. Tree Page Latch
waits (sys.dm_db_index_operational
stats)
Leaf
Page
s
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
e
e
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
BBtree
tree
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
e
e
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Tree
Pages
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Dat
Dat
a
a
Pag
Pag
e
e
Logical Key Order of Index

Monotonically Increasing
Many threads
inserting into end of
range
We call this Last Page Insert Contention

Expect: PAGELATCH_EX/SH waits
And this is the observation
How to Solve INSERT hotspot

Option #1: Hash partition the table
Based on hash of a column (commonly a
modulo)
Creates multiple B-trees (each partition is a
B-tree)
Option #2: Do not use a

sequential key
Distribute the inserts all
over the B-tree
Threads inserting into end

of range but across each
partition
Threads inserting into
end of range
contention on last
2001
- 3000
3001
- 4000
INSERT
1001
- 2000
INSERT
Hash Partitioned Table / Index
0
-1000
INSERT
Before
After
INSERT
Round robin between the B-trees create

more resources and less contention
Hash Partitioning Reference:

http://sqlcat.com/technicalnotes/archive/
2009/09/22/resolving-pagelatchcontention-on-highly-concurrent-insert-
Example: Before Hash Partitioning

1
1
Latch waits of approximately 36 ms at

baseline of 99 checks/sec.
2
2
Example: After Hash Partitioning*

1
1
3
3
Latch waits of approximately 0.6 ms at

highest throughput of 249 checks/sec.
4
4
2
2
*Other optimizations were applied, Hash Partitioning

was responsible to a 2.5x improvement in insert
throughput
Table Partitioning Example

--Create the partition scheme and function
1
1 CREATE PARTITION FUNCTION [pf_hash16] (tinyint) AS RANGE LEFT FOR VALUES
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)

2 CREATE PARTITION SCHEME [ps_hash16] AS PARTITION [pf_hash16] ALL TO ( [ALL_DATA] )
2
-- Add the computed column to the existing table (this is an OFFLINE operation of done the simply way)
- Consider using bulk loading techniques to speed it up.
ALTER TABLE [dbo].[Transaction]
ADD [HashValue] AS (CONVERT([tinyint], abs(binary_checksum([uidMessageID ])%(16)),(0)))
PERSISTED NOT NULL
3
3
--Create the index on the new partitioning scheme
Note: Requires
application
changes
CREATE
UNIQUE CLUSTERED
INDEX
[IX_Transaction_ID] ON [dbo].[Transaction([Transaction_ID ],
[HashValue])
ON ps_hash16(HashValue)
Ensure
Select/Update/Delete
have appropriate partition elimination
Network Cards Rule of Thumb

At scale, network traffic will generate a LOT of interrupts for
the CPU
These must be handled by CPU Cores
Must distribute packets to cores for processing
Tuning a Single NIC Card POS system

Enable RSS to enable multiple CPUs to process receive
indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx
The next step was to disable the Base Filtering Service in

Windows and explicitly enable TCP Chimney offload.
Turning off Base Filtering Service huge reduction in CPU may
not be suitable for all production environments
Careful with Chimney Offload as per KB 942861
Before and After Tuning Single NIC

1
1
2
2
3
3
Single 1 Gb/s NIC
1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly
enabling TCP Chimney Offload CPU time on CPU0 was reduced. The
base CPU for RSS successfully moved from CPU0 to another CPU.
ToCom+
DTC
or
not
to
DTC:
POS
System
transactional applications are still prevalent today
This results in all database calls enlisting in a DTC transaction
45% performance overhead
Scenario in the lab involved two Resource Managers MSMQ and SQL:
wait_type
total_wait_time_ms
total_waiting_tasks_count
average_wait_ms
DTC_STATE
5,477,997,934
4,523,019
1,211
PREEMPTIVE_TRANSIMPORT
2,852,073,282
3,672,147
776
PREEMPTIVE_DTC_ENLIST
2,718,413,458
3,670,307
740
Tuning approaches
1.
2.
Optimize DTC TM configuration (transparent to app)

Remove DTC transactions (requires app changes)
Utilize System.Transactions which will only promote to DTC if more than one RM is
involved
See Lightweight transactions:
http://msdn.microsoft.com/en-us/magazine/cc163847.aspx#S5
Optimizing DTC Configuration
Default application servers use local TM (MSDTC Coordinator)

Introduces RPC communication between SQL TM and App Server TM
App virtualization layer incurs some delay
Configuring application servers to use remote coordinator removes RPC communication
See Mike Ruthruffs paper on SQLCAT.COM:
http://
sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx
Recap: Session Objectives and

Session Objectives:
Takeaways
Learn about SQL Server capabilities and challenges experienced by some of our
extreme OLTP customer scenarios.
Insight into diagnosing and architecting around issues with Tier-1, mission
critical workloads.
Key Takeaways
SQL Server can meet the needs of many of the most challenging OLTP scenarios
in the world.
There are a number of new challenges when designing for high end OLTP
systems.
Applied Architecture Patterns on the Microsoft Platform
Q &A
46
2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
47
Agenda
Windows Server 2008R2 and SQL Server 2008R2 improvements

Scale architecture
Customer Requirements
Hardware setup
Transaction log essentials
Getting the code right

Application Server Essentials
Database Design
Tuning Data Modification

UPDATE statements
INSERT statements
Management of LOB data
The problem with NUMA and what to do about it
Final results and Thoughts

48
Top statistics
Category
Largest single database
Largest table
Metric
80 TB
20 TB
Biggest total data 1 customer
2.5 PB
Highest write per second 1 db
60,000
Fastest I/O subsystem in production

(and in lab)
Fastest real time cube
18 GB/sec
(26GB/sec)
1 sec latency
data load for 1TB

Largest cube
20 minutes
12 TB
49
Upping the Limits

Previous (before 2008R2) windows was limited to 64 cores
Kernel tuned for this config
With Windows Server 2008R2 this limit is now upped to 256
Cores (plumbing for 1024 cores)

New concept: Kernel Groups
A bit like NUMA, but an extra layer in the hierarchy
SQL Server generally follows suit but for now, 256 Cores is
limit on R2
Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up
to 256 Cores)
And largest IA-64 is 256 Hyperthread (at 128 Cores)
50
The Path to the Sockets

Windows OS
Hardware
NUMA 6
Kernel
Group 0
Kernel
Group 1
NUMA 0
NUMA 2
NUMA 4
NUMA 6
NUMA 1
NUMA 3
NUMA 5
NUMA 7
NUMA 8
NUMA 10
NUMA 12
NUMA 14
NUMA 9
NUMA 11
NUMA 13
NUMA 15
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
NUMA 7
Kernel
Group 2
NUMA 16
NUMA 17
NUMA 19
NUMA 21
NUMA 23
Kernel
Group 3
NUMA 24
NUMA 26
NUMA 28
NUMA 30
NUMA 25
NUMA 27
NUMA 29
NUMA 31
NUMA 18
NUMA 20
NUMA 22
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
CPU
Socket
CPU Core
HT
HT
CPU Core
HT
HT
51
And we measure it like this

Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
Nehalem-EX
How fast is your interconnect.
52
And it Looks Like This...
53
Customer Scenarios
Core Banking
Healthcare System
POS
Workload
Credit Card
transactions from
ATM and Branches
Sharing patient information

across multiple healthcare
trusts
World record deployment

of ISV POS application
across 8,000 US stores
Scale
Requirements
10.000 Business
Transactions / sec
37,500 concurrent users
Handle peak holiday load

of 228 checks/sec
Technology
App Tier .NET

3.5/WCF
SQL 2008R2
Windows 2008R2
App Tier: .NET

SQL 2008R2
Windows 2008R2
Virtualized App Tier: Com+,

Windows 2003
SQL 2008, Windows 2008
Server
HP Superdome
HP DL785G6
IBM 3950 and HP DL 980
DL785
54
Network Cards Rule of Thumb

At scale, network traffic will generate a LOT of interrupts for
the CPU
These must be handled by CPU Cores
Must distribute packets to cores for processing
Rule of thumb (OTLP): 1 NIC / 16 Cores

Watch the DPC activity in Taskmanager
In Windows 20003 remove SQL Server (with affinity mask) from the NIC cores
55
Lab: Network Tuning Approaches

1. Tuning configuration options of a single NIC card to provide
the maximum throughput.

2. Improve the application code to compress LOB data before
sending it to the SQL Server

3. Team a pair of 1 Gb/s NICs to provide more bandwidth
(transparent to the app).

4. Add multiple NICS (better for scale )
56
Tuning a Single NIC Card POS system

Enable RSS to enable multiple CPUs to process receive
indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx
The next step was to disable the Base Filtering Service in
Windows and explicitly enable TCP Chimney offload.

Careful with Chimney Offload as per KB 942861
57
Before and After Tuning Single NIC

1
1
2
2
3
3
1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP
Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS
successfully moved from CPU0 to another CPU.
58
SQL Server Configuration Changes

As we increased number of connections to around 6000 (users had
think time) we started seeing waits on THREADPOOL

Solution: increase sp_configure max worker threads
Probably dont want to go higher than 4096
Gradually increase it, default max is 980
Avoid killing yourself in thread management bottleneck is likely somewhere else
Use affinity mask to get rid of SQL Server for cores running NIC traffic
Well tuned, pure play OLTP
No need to consider parallel plans
Sp_configure max degree of parallelism, 1
60
Designing Highly Scalable OLTP Systems
Getting the Code Right
61
Things to Double Check

Connection pooling enabled?
How much connection memory are we using?
Monitor perfmon: MSSQL: Memory Manager
Obvious Memory or Handle leaks?

Check perfmon Process counters in perfmon for .NET app
Server side processes will keep memory unless under pressure
Can the application handle the load?

Call into dummy procedures that do nothing
Check measured application throughput
Typical case: Application breaks before SQL
63
Remote Calling from WCF

Original client code: Synchronous calls in WCF
Each thread must wait for network latency before proceeding
Around 1ms waiting
Very similar to disk I/O thread will fall asleep
Lots of sleeping threads
Limited to around 50 client simulations per machine
Instead, use IAsyncInterface
64
Designing Highly Scalable OLTP Systems
Tuning Data Modification
65
Database Schema Credit Cards

UPDATE SET Balance
UPDATE ..
SET LastTransaction_ID = @ID + 1
LastTransactionDate = GETDATE()
Account
INSERT .. VALUES (@amount)
INSERT .. VALUES (-1 * @amount)
ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID
10**3 rows
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount
Account_ID
LastUpdateDate
Balance
10**5 rows
10**10 rows
66
Summary of Concerns
Transaction table is hot

Lots of INSERT
How to handle ID numbers?
ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID
Allocation structures in database
Account table must be
Account
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount
Account_ID
LastUpdateDate
Balance
transactionally consistent with Transaction

Do I trust the developers to do this?
Cannot release lock until BOTH are in sync
What about latency of round trips for this
Potentially hot rows in Account

Are some accounts touched more than others
ATM Table has hot rows.

Each row on average touched at least ten times per second
E.g. 10**3 rows with 10**4 transactions/sec
67
Generating a Unique ID
Why wont this work?
CREATEPROCEDUREGetID
@IDINTOUTPUT
@ATM_IDINT
AS
DECLARE@LastTransaction_IDINT
SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=@ATM_ID
SET@ID=@LastTransaction_ID+1
UPDATEATM
SET@LastTransaction_ID
WHEREATM_ID=@ATM_ID
68
Concurrency is Fun
ATM
ID_ATM = 13
LastTransaction_ID = 42
FROMATM
WHEREATM_ID=13
(@LastTransaction_ID=42)
FROMATM
WHEREATM_ID=13
(@LastTransaction_ID=42)
UPDATEATM
SET@LastTransaction_ID=@ID
WHEREATM_ID=13
UPDATEATM
SET@LastTransaction_ID=@ID
WHEREATM_ID=13
69
Generating a Unique ID The Right way

CREATEPROCEDUREGetID
@IDINTOUTPUT
@ATM_IDINT
AS
UPDATEATM
SETLastTransaction_ID=@ID+1
,@ID=LastTransaction_ID
WHEREATM_ID=@ATM_ID
And it it is simple too...
70
Hot rows in ATM

Initial runs with a few hundred ATM shows excessive waits
for LCK_M_U
Diagnosed in sys.dm_os_wait_stats
Drilling down to individual locks using sys.dm_tran_locks
Inventive readers may wish to use Xevents
Event objects: sqlserver.lock_acquired and sqlos.wait_info
Bucketize them
As concurrency increases, lock waits keep increasing

While throughput stays constant
Until...
71
Spinning around
1.00E+14
20000
18000
1.00E+12
16000
1.00E+10
14000
12000
1.00E+08
Spins
10000
1.00E+06
Throughput
8000
lg(Spins)
Throughput
6000
1.00E+04
4000
1.00E+02
2000
1.00E+00
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0
100000
Requests
Diagnosed using sys.dm_os_spinlock_stats

Pre SQL2008 this was DBCC SQLPERF(spinlockstats)
Can dig deeper using Xevents with sqlos.spinlock_backoff event
We are spinning for LOCK_HASH
72
LOCK_HASH what is it?
More Threads
LO
H
_
CK
H
S
A
Lock Manager
ROW
Thread
LCK_U
- Why not go to sleep?
73
Locking at Scale
Ratio between ATM machines and transactions generated
too low.
Can only sustain a limited number of locks/unlocks per second
Depends a LOT on NUMA hardware, memory speeds and CPU caches
Each ATM was generating 200 transactions / sec in test harness
Solution: Increase number of ATM machines

Key Takeway: If a locked resource is contended create more of it
Notice: This is not SQL Server specific, any piece of code will be bound by
memory speeds when access to a region must be serialized
74
Hot rows in Account

Three ways to update Account table
1) Let application servers invoke transaction to both insert in
TRANSACTION and UPDATE account

2)
Set a trigger on TRANSACTION
3) Create stored proc that handles the entire transaction

) Option 1 has two issues:
App developers may forget in all code paths
Latency of roundtrip: around 1ms i.e. no more than 1000 locks/sec possible on single
row
) Option 2 is better choice!

) Option 3 must be used in all places in app to be better than option 2.
76
Hot Latches!
LCK waits are gone, but we are seeing

very high waits for PAGELATCH_EX
High = more than 1ms
What are we contending on?
Latch a light weight semaphore

Locks are logical (transactional
consistency)
Latches are internal SQL Engine
Page (8K)
ROW
ROW
PAGELATCH_EX
LCK_U
ROW
ROW
LCK_U
(memory consitency)
Because rows are small (many fit a

page) multiple locks may compete for
one PAGELATCH
77
Row Padding
In the case of the ATM table, our rows

are small and few
We can waste a bit of space to get

more performance
Solution: Pad rows with CHAR column

to make each row take a full page
Page (8K)
ROW
CHAR(5000)
PAGELATCH_EX
LCK_U
1 LCK = 1 PAGELATCH
ALTERTABLEATM
ADDCOLUMNPaddingCHAR(5000)NOTNULL
DEFAULT(X)
78
INSERT throughput
Transaction table is by far the most active table
Fortunately, only INSERT
No need to lock rows
But several rows must still fit a single page
Cannot pad pages there are 10**10 rows in the table

A new page will eventually be allocated, but until it is,
every insert goes to same page

Expect: PAGELATCH_EX waits
And this is the observation
79
Hot page at the end of B-tree with increasing index
80
Waits & Latches

Dig into details with:
sys.dm_os_latch_waits
wait_type
% Wait Time
PAGELATCH_SH
86.4%
PAGELATCH_EX
8.2%
LATCH_SH
1.5%
LATCH_EX
1.0%
LOGMGR_QUEUE
0.9%
CHECKPOINT_QUEUE
0.8%
ASYNC_NETWORK_IO
0.8%
WRITELOG
0.4%
latch_class
wait_time_ms
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
156,818
LOG_MANAGER
103,316
81
How to Solve INSERT hotspot

Hash partition the table
Do not use a sequential key
Create multiple B-trees
Distribute the inserts all over
Round robin between the B-trees
the B-tree
create more resources and less

contention
3,11,19
4,12,20
5,13,21
6,14,22
7,15,23
0
-1000
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
2,10,18
INSERT
1,9,17
0
1
2
3
4
5
6
7
INSERT
0,8,16
hash
INSERT
ID
82
Design Pattern: Table Hash Partitioning
Create new filegroup or use existing to hold the

partitions
Use CREATE PARTITION FUNCTION command
Partition the tables into #cores partitions
Use CREATE PARTITION SCHEME command
Equally balance over LUN using optimal layout
Bind partition function to filegroups
hash
0
1
2
3
4
5
6
Add hash column to table (tinyint or smallint)
Calculate a good hash distribution
For example, use hashbytes with modulo or

binary_checksum
253
254
255
83
Lab Example: Before Partitioning

1
1
Latch waits of approximately 36 ms at

baseline of 99 checks/sec.
2
2
84
Lab Example: After Partitioning*

1
1
3
3
Latch waits of approximately 0.6 ms at

highest throughput of 249 checks/sec.
4
4
2
2
*Other optimizations were applied

85
B-Tree Root Split

Virtual
Root
(ACCESS_METHODS
HBOT_VIRTUAL_ROOT)
SH
SHH
SEX
PAGELATCH
SSH
EHX
SH
EX
PAGELATCH
PAGELATCH
EX
EX
EX
Prev
Next
X
LATCH
LCK
87
NUMA and What to do

Remember those PAGELATCH for UPDATE statements?
Our solution: add more pages
Improvemnet: Get out of the PAGELATCH fast so next one
can work on it
On NUMA systems, going to a foreign memory node takes
at least 4-10 times more expensive

Use SysInternals CoreInfo tool
89
How does NUMA work in SQL Server?
The first NUMA node to request a page will own that page
Ownership continues until page is evicted from buffer pool
Every other NUMA node that need that page will have to do foreign memory access
Additional (SQL 2008) feature is SuperLatch

Useful when page is read a lot but written rarely
Only kicks in on 32 cores or more
The this page is latched information is copied to all NUMA nodes
Acquiring a PAGELATCH_SH only requires local NUMA access
But: Acquiring PAGELATCH_EX must signal all NUMA nodes
Perfmon object: MSSQL:Latches
Number of SuperLatches
SuperLatch demotions / sec
SuperLatch promotions / sec
See CSS blog post

90
Effect of UPDATE on NUMA traffic

ATM_ID
NUMA 0
UPDATEATM
SETLastTransaction_ID
NUMA 1
UPDATEATM
NUMA 2
2
NUMA 3
UPDATEATM
App Servers
UPDATEATM
91
Using NUMA affinity

ATM_ID
NUMA 0
UPDATEATM
NUMA 1
UPDATEATM
NUMA 2
2
NUMA 3
UPDATEATM
UPDATEATM
Port: 8000
Port: 8001
Port: 8002
Port: 8003
How to: Map TCP/IP Ports to NUMA Nodes

92
Final Results and thoughts

120.000 Batch Requests / sec
100.000 SQL Transactions / sec
50.000 SQL Write Transactions / sec
12.500 Business Transactions / sec
CPU Load: 34 CPU cores busy
Given more time, we would get the CPUs to 100%, Tune the NICs more, and
work on balancing NUMA more.

And of NIC, we only had two and they were loading two CPU at 100%
93
Case Study: Online Gaming

Application: Online sports betting, poker and casino play.
Performance:
15 million page views, 980,000 users per day
Over 30 thousand database transactions per second, 500+ billion per day
450,000 SQL Statements/sec on single database
Multiple systems comprise the gaming experience including payment, casino
games, sportsbook, etc

Require very low latency and must meet high transaction volumes - based on
number of users on system

Over 100 SQL Server instances and 1,400 databases in architecture
Scale-up the payment system. HP Superdome (32-socket, 2 core; 256GB).
Investigating x64
Co-operative Scale-out for actual gaming activity
Case Study: Online Gaming (cont.)

Failure is not an option

Zero data loss and achieved 99.998% availability
Use database mirroring and log shipping across datacenters to achieve
HA/DR goals
http://
sqlcat.com/whitepapers/archive/2010/06/07/proven-sql-server-architectures-for-high-av
ailability-and-disaster-recovery.aspx
http://
sqlcat.com/whitepapers/archive/2010/11/03/failure-is-not-an-option-zero-data-loss-and
-high-availability.aspx
Use SQL Server replication for reporting
Observation
Large scale of users, with low latency requirements
Hot spots on heavily hit tables - page latching
Scale-out helped increase transaction volume (#/sec)
Online Gaming Infrastructure

(no HA, DR & Backup shown)
This is not a WinMo7 its

a SuperDome
Betcache
4+
Casino
2+
VS
Games
2+
1x2
Games
12+
CMS
15+
News
Letter 2+
Other
30+
User Account &

Sportsbook
8+
Bookmaking
2+
BGI
CSM
2+
Other
40+
Payment
20+
Repl
ASP.NET
Sessions
8+
SMS
4+
Other
20+
DWH
Stage
50+
Internal
Office,
Sharepoint
(300+)
DWH
60+
Moni-toring
10+
OLAP
10+
Administration
20+

OLTP
Challenge
Network
CPU bottlenecks for network processing were observed and resolved via network tuning (RSS)
Dedicated networks for backup, replication etc
8 network cards for clients
Concurrency
Latch contention on heavily used tables, last page insert

Hash partition option caused other problem in query performance and application design.
Resolution: Co-operative scale-out
Transaction Log
Latency on log writes

Resolution: Increased throughput/decreased latency by placing transaction log on SSDs
Database mirroring overhead very significant on Synchronous
Database and table

design/Schema
Monitoring
Security monitoring (PCI and intrusion detection) between 10%-25% impact/overhead when
monitoring
Tests using x64 (8-socket; 8-core) vs. Itanium-Superdome(32-socket,dual-core)

Same transaction throughput
IO and backups were a bit slower
Latency on IO intensive data files (including Tempdb):

Resolution: Session state database on SSDs;
Resolution: Betting slips/customer databases testing sharding
Single server, single database 500/tx/sec
Single server, 4 databases 1,800/tx/sec (sharding)
Multiple servers 2,600/tx/sec (sharding)
Case Study: Financial Stock Market

Application: Real-time high transaction, low latency stock quoting
Performance:
Over 280,000 business transactions/sec
Over 1 million data manipulation calls per second in single database
Latency per business transaction under 1 millisecond.
Real-time nature of data flow.
Send large batch containing multiple business transactions, parse and end up
inserting all records into a large table which is constantly read

Under 20 tables in the application. Nothing generic about code/solution
Load distributed based on alphabetical split.
Co-operative Scale-out. Commodity based hardware (2-socket, quad-core pre-
Nehalem) and high performance SAN.
Case Study: Financial Stock Market (cont.)

Mission critical to the business in terms of performance and availability.
Require 99.999% uptime overall and 100% during business day.
Treat system like their mainframe operations

Utilize SQL Server HA features to help support 5 9s uptime requirement
and geographical redundancy

SQL Server Failover Clustering for local (within datacenter) availability
Database Mirroring (High Availability/Async) for geo-availability
Locations around 300 miles apart
30MB/sec log generation with no-send queue
Observation
Extreme low latency and high throughput requirements with machine born data
lead to hitting a number of the same bottlenecks we observed more commonly in

the scale-up scenarios.
Stock Market Architecture (1)
Stock Market Architecture (2)

Challenge
OLTP
Network
Network round trip time for synchronous call from client induced latency
Concurrency
Resolution: Batch data into single large parameter (varchar (8000)) to avoid network
roundtrips
Page Latch Contention Small table

Small table with latching on 36 Rows on single page.
Resolution: pad the rows to spread out latching to multiple pages; Performance
Gain: 20%
Page Latch Contention Large Table
Concurrent INSERTS into incremental column (identity), last page insert
Resolution: Clustered Index (partition_id & identity) column; Performance Gain:
30%
Heavy, long running threads contenting for time on the scheduler
Resolution: Map TCP/IP Ports to NUMA Nodes (
http://msdn.microsoft.com/en-us/library/ms345346.aspx); Performance
Gain: 20%
Transaction Log
Database and table

design/Schema
Monitoring
Logwaits:
Resolution: Batching business transactions within a single COMMIT to avoid
WRITELOG waits
Test of SSDs for log helped with latency.
Change decimal datatypes to money, others to int

Integer based datatypes go through optimized code path; Performance Gain:
10%
No RI as this has an overhead on performance. Executed in the application.
5% overhead in running default trace alone

Collect perfmon and targeted DMV/Xevents output to repository

Designing Highly Scalable OLTP Systems

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Designing Highly Scalable OLTP Systems

Hochgeladen von

Copyright:

Verfügbare Formate

High Scale OLTP

Lessons Learned from SQLCAT

Session Objectives and Takeaways

Laying the foundation and tuning for OLTP

Laying the foundation and tuning for OLTP workloads:

Apply Configuration and Best Practices guidance

Be familiar with common performance methodologies,

Data files and TempDb on same LUNs

Special Consideration: Transaction Log

Pre-allocate log file

Key Takeway: For Transaction Log, dedicate storage components

SQL Server Memory Setup

Consider TF834 (Large page Allocations)

Beware of NUMA node memory distribution

How we think about tuning

Standard tuning always applies (indexes, TSQL, etc)

*Focus of tuning depends on the workload, foundation areas can

Laying the foundations for OLTP Performance

the theoretical capabilities of the systems in order to

Network Card Tuning is often needed for throughput intensive

Upping the Limits

With Windows Server 2008R2 this limit is now upped to

The Path to the Sockets

SQL Server Today: Capabilities and

Case Study: Large Healthcare Application

Hardware/Deployment Configuration (Benchmark):

Case Study: Large Healthcare

Require zero data loss (patient data)

Application connections must run with lowest privileges possible

HealthCare Application - Technical Challenges

10 Gb/s network used no bottlenecks observed

Observed spikes in CPU at random times during workload

Synchronous replication at the storage level

Database and table

Schema utilizes hash partitioning to avoid page latch contention on inserts

Heavily utilized Extended Events to diagnose spinlock contention points

Currently running 16 socket IA64 in production

Log Growth and Virtual Log File Fragmentation

Additional information can be found here

select * from sys.dm_os_spinlock_stats

These symptoms may indicate spinlock

Spinlock Diagnosis Walk Through

Extended events capture the backoff

Much higher CPU

theory via dm_os_spinlock_stats observe this

Spinlock Walkthrough Extended Events

--To view the data

Regeneration of Security Tokens Result in High

Observation: It is counterintuitive to have high waits times (LCK_M_X) correlate

Fully Qualified Calls To Stored Procedures

Case Study: Point of Sale (POS) System

230 business transactions = ~50,000 batches/sec

Heavy network utilization due to inserts and use of BLOB data

48 Physical cores, 256GB RAM

Case Study: Point of Sale (POS) System (cont.)

Mission critical to the business in terms of performance and

seconds and 3 minutes for unplanned node failures

POS Benchmark Configuration

Technical Challenges and Architecting for Extreme

network tuning (RSS)

Database and table

Observed overhead related to PK/FK relationships. Insert statements required additional

For the latch contention, utilized dm_os_wait_stats, dm_os_waiting_tasks and

Be careful about shared components in Blade server deployments this became a