You are on page 1of 95

High Scale OLTP

Lessons Learned from SQLCAT


Performance Labs
Ewan Fairweather:
Microsoft

Program Manager

Session Objectives and Takeaways


Session Objectives:
Learn about SQL Server capabilities and challenges experienced by some of our
extreme OLTP customer scenarios.
Insight into diagnosing and architecting around issues with Tier-1, mission
critical workloads.

Key Takeaways
SQL Server can meet the needs of many of the most challenging OLTP scenarios
in the world.
There are a number of new challenges when designing for high end OLTP
systems.

Laying the foundation and tuning for OLTP

Laying the foundation and tuning for OLTP workloads:


Understand goals and attributes of workload
Performance requirements
Machine born data vs. User driven solution
Read-Write ratio
HA/DR requirements which may have an impact

Apply Configuration and Best Practices guidance


Database and data file considerations
Transaction Log sizing and placement
Configuring the SQL Server Tempdb Database
Optimizing memory configuration

Be familiar with common performance methodologies,


toolsets and common OTLP / Scaling performance pain points
Know your environment Understand hardware is key

Hardware
Setup

Database
files
Database Files
# should be at least 25% of CPU cores
This alleviates PFS contention PAGELATCH_UP
There is no signficant point of diminishing returns up to 100% of CPU cores
But manageability, is an issue...
Though Windows 2008R2 is much easier

TempDb
PFS contention is a larger problem here as its an instance wide resource
Deallocations and Allocations , RCSI version store, triggers, temp tables
# files shoudl be exactly 100% of CPU Threads
Presize at 2 x Physical Memory

Data files and TempDb on same LUNs


Its all random anyway dont sub-optimize
IOPS is a global resource for the machine. Goal is to avoid PAGEIOLATCH on any data file

Key Takeaway: Script it! At this scale, manual work WILL drive you
insane

Special Consideration: Transaction Log


Transaction log is a set of 127 linked buffers with max 32
outstanding IOPS
Each buffer is 60KB
Multiple transactions can fit in one buffer
BUT: Buffer must flush before log manager can signal a commit OK

Pre-allocate log file


Use dbcc loginfo for existing systems
Example: Transaction log throughput was ~80MB/sec
But we consistently got <1ms latency, no spikes!
Initial Setup: 2 x HBA on dedicated storage port on RAID10 with 4+4
When tuning for peak: SSD on internal PCI bus (latency: a few s)

Key Takeway: For Transaction Log, dedicate storage components


and optimize for low latency

SQL Server Memory Setup


For large CPU/Memory box, Lock Pages in Memory really
matters
We saw more than double performance
Use gpedit.msc to grant it to SQL Service account

Consider TF834 (Large page Allocations)


On Windows 2008R2 previous issues with this TF are fixed
Around 5-10% throughput increase
Increases startup time

Beware of NUMA node memory distribution


Set max memory close to box max if dedicated box available

How we think about tuning


Let the workload access patterns guide you
Observe resource consumption and resource waits
http://sqlcat.com/whitepapers/archive/2007/11/19/sql-server-2005-waits-and-queues.aspx
http://
sqlcat.com/whitepapers/archive/2009/04/14/troubleshooting-performance-problems-in-sql-serve
r-2008.aspx

Standard tuning always applies (indexes, TSQL, etc)


On these systems we always watch for concurrency related
bottlenecks and key components which affect throughput
Locking, latching, spinlocks, log latency, etc.

*Focus of tuning depends on the workload, foundation areas can


bubble to the top. Focus on the 20% of issues that will give 80%
In this talk we will focus on
of optimization
the unique challenges we face
on high concurrency and
applications requiring low
latency

Laying the foundations for OLTP Performance


The Hardware Plays a Big Role It is critical understand

the theoretical capabilities of the systems in order to


succeed
Understand server architecture (NUMA, PCI layout, etc)
Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect Sysinternals CoreInfo

Network Card Tuning is often needed for throughput intensive


workloads
Storage Never go in blind! Knowing only its a SAN will lead to
disaster.
Understand and document all components in the path from the server to the disk (HBAs,
PCI, network, connectivity on the array, disk configuration, are the resources shared, etc..)
Test the storage before running SQL workload

Upping the Limits


Previous (before 2008R2) windows was limited to 64 cores
Kernel tuned for this config

With Windows Server 2008R2 this limit is now upped to


256 Cores (plumbing for 1024 cores)
New concept: Kernel Groups
A bit like NUMA, but an extra layer in the hierarchy

SQL Server generally follows suit but for now, 256 Cores
is limit on R2
Example x64 machines: HP DL980 (64 Cores, 128 in
HyperThread). IBM 3950 (up to 256 Cores)
And largest IA-64 is 256 Hyperthread (at 128 Cores)

The Path to the Sockets


Windows OS

Hardware
NUMA 6

Kernel
Group 0
Kernel
Group 1

NUMA 0

NUMA 2

NUMA 4

NUMA 6

NUMA 1

NUMA 3

NUMA 5

NUMA 7

NUMA 8

NUMA 10

NUMA 12

NUMA 14

NUMA 9

NUMA 11

NUMA 13

NUMA 15

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

NUMA 7

Kernel
Group 2

NUMA 16
NUMA 17

NUMA 19

NUMA 21

NUMA 23

Kernel
Group 3

NUMA 24

NUMA 26

NUMA 28

NUMA 30

NUMA 25

NUMA 27

NUMA 29

NUMA 31

NUMA 18

NUMA 20

NUMA 22

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

SQL Server Today: Capabilities and


Challenges with real customer workloads

Case Study: Large Healthcare Application


Application: Patient care application (workflow, EMR, etc)
Performance:
Sustain 9,500 concurrent application users with acceptable response time &

total CPU utilization 15,000 planned for March/April 2011 with ultimate goal
of 25,000+

Workload Characteristics:
6,000-7,000 batches/sec with a Read/Write ratio of about 80/20
Highly normalized schema, lots of relatively complex queries (heavy on loop

joins), heavy use of temporary objects (table valued functions), use of BLOBs,
transactional and storage based replication

Hardware/Deployment Configuration (Benchmark):


24 Application Servers, 12 Load generators (LoadRunner)
Database servers: DL980 and IBM 3950 (2 node single SQL Server failover

cluster instance)

Case Study: Large Healthcare


Application (cont.)
Other Solution Requirements:

Require zero data loss (patient data)


Use Synchronous IO SAN based replication for DR
This means we have to tolerate some transaction log overhead (3-5ms)

Application connections must run with lowest privileges possible


Application audits all access to patient data
Near real time reporting required (transactional replication used to

scale-out)
Observation

x64 Servers provide >2x per core processing over previous IA64 CPUs

HealthCare Application - Technical Challenges

Challenge

Consideration/Workaround

Network

10 Gb/s network used no bottlenecks observed

Concurrency

Observed spikes in CPU at random times during workload


Significant spinlocks contention on SOS_CACHESTORE due to frequent re-generation of
security tokens
Hotfix provided by SQL Server team
Result SOS_CACHESTORE contention removed
Spinlock contention on LOCK_HASH due to heavy reading of same rows
This was due to an incorrect parameter being passed in by test workload
Result LOCK_HASH contention removed, reduced CPU from 100% to 18%

Transaction Log

Synchronous replication at the storage level


Observed 10-30ms for log latency expected 3-5ms
Encountered Virtual Log File fragmentation (dbcc loginfo) rebuilt log
Observed overutilization of front end fiber channel ports on array - reconfigured
storage balancing traffic across front end ports
Result: 3-5ms latency

Database and table


design/Schema

Schema utilizes hash partitioning to avoid page latch contention on inserts


Requirement for low privileges requires longer code paths in the SQL engine

Monitoring

Heavily utilized Extended Events to diagnose spinlock contention points

Architecture/Hardware

Currently running 16 socket IA64 in production


Benchmark performed on 8 socket x64 Nehalem-EX (64 physical cores)
Hyper-threading to 128 logical cores offered little benefit to this workload
Encountered high NUMA latencies (coreinfo.exe) resolved via firmware updates

NUMA latencies
Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx

Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect.

Log Growth and Virtual Log File Fragmentation


SQL Server physical transaction log is comprised of Virtual Log Files (VLFs)
Each auto-growth/growth event will add additional VLFs
Frequent auto-growths can introduce a large number of VLFs which can have
a negative effect on log performance due to:
1. Overhead of the additional VLFs
2. File system fragmentation

Additional information can be found here


Consider rebuilding log if you find 100s or 1,000s of VLFs
DBCC LOGINFO can be used to report on this (example below)
FileId
FileSize
StartOffset
FSeqNo
Status
Parity CreateLSN
----------- -------------------- -------------------- ----------- ----------- ------ --------------------------------------2
253952
8192
48141
0
64
0

2
427556864
74398826496
0
0
128 22970000047327200649
2
427950080
74826383360
0
0
128 22970000047327200649

I N T E R N A L O N LY

Spinlocks
Lightweight synchronization primitives used to protect access to data
structures
Used to protect structures in SQL such as lock hash tables (LOCK_HASH),
security caches (SOS_CACHESTORE) and more
Used when it is expected that resources will be held for a very short duration
Why not yield?
It would be more expensive to yield and context switch than spin to acquire the
resource
Threads accessing the same
hash bucket of the table
are synchronized

LO

H
_
CK

H
S
A

Resourc
e
Hash
of lock
maint
Lock Manager
a
table ined in
ha sh
Lock Hash
Table

Thread attempts to
obtain lock (row,
page, database,
etc..

Spinlocks Diagnosis
1
1

select * from sys.dm_os_spinlock_stats


order by spins desc

2
2

3
3

5
5

These symptoms may indicate spinlock


contention:
1. A high number of spins is reported for a
particular spinlock type.
AND
2. The system is experiencing heavy CPU
utilization.
AND
3. The system has a high amount of concurrency.

4
4

Spinlock Diagnosis Walk Through


3
3

1
1

Extended events capture the backoff


events over a 1 min interval & provide
the code paths of the contention
security check related
Not a resolution but we know where
to start

Much higher CPU


with drop in
throughput (At
this point many
SQL threads are
spinning)

2Confirmed
2

theory via dm_os_spinlock_stats observe this


type with highest spins & backoffs
High backoffs = contention

Name
SOS_CACHESTORE
SOS_SUSPEND_QUEUE
LOCK_HASH
MUTEX
SOS_SCHEDULER

Collisions
14,752,117
69,267,367
5,765,761
2,802,773
1,207,007

Spins
942,869,471,526
473,760,338,765
260,885,816,584
9,767,503,682
3,692,845,572

Spins_Per_Collision
63,914
6,840
45,247
3,485
3,060

Backoffs
67,900,620
2,167,281
3,739,208
350,997
109,746

Spinlock Walkthrough Extended Events


Script
--Get the type value for any given spinlock type
select map_value, map_key, name from sys.dm_xe_map_values
where map_value IN ('SOS_CACHESTORE')

--create the even session that will capture the callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action (package0.callstack)
where
type = 144
--SOS_CACHESTORE
)
add target package0.asynchronous_bucketizer (
set filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)
--Ensure the session was created
select * from sys.dm_xe_sessions
where name = 'spin_lock_backoff'
--Run this section to measure the contention
alter event session spin_lock_backoff on server state=start
--wait to measure the number of backoffs over a 1 minute period
waitfor delay '00:01:00'

--To view the data


--1. Ensure the sqlservr.pdb is in the same directory as the
sqlservr.exe
--2. Enable this trace flag to turn on symbol resolution
DBCC traceon (3656, -1)
--Get the callstacks from the bucketize target
select event_session_address, target_name, execution_count, cast (target_data as XML)
from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)
where xs.name = 'spin_lock_backoff'
--clean up the session
alter event session spin_lock_backoff on server state=stop

A complete walkthrough
of the technique can be
found here:
http://sqlcat.com/msdnmirror/archive
/2010/05/11/resolving-dtc-relatedwaits-and-tuning-scalability-ofdtc.aspx

Regeneration of Security Tokens Result in High


SOS_CACHESTORE Spins
At Random Times CPU spikes, then almost all sessions waiting on LCK_M_X

Huge
increase in number of spins & backoffs associated with SOS_CACHESTORE
2

Observation: It is counterintuitive to have high waits times (LCK_M_X) correlate


with heavy CPU This is the symptom not the cause

Approach: Use extended events to profile the code path with the spinlock contention (i.e.
where there is a high number of backoffs)

Root cause: Regeneration of security tokens exposes contention in code paths for
access permission checks
Workaround/Problem Isolation: Run with sysadmin rights
Long Term Changes Required: SQL Server fix

Fully Qualified Calls To Stored Procedures


Developer uses Exec myproc for dbo.myproc
SQL acquires an exclusive lock LCK_M_X and prepares to compile the
procedure; this includes calculating the object ID
dm_exec_requests revealed almost all the sessions were waiting
on LCK_M_X to compile a stored procedure
Workaround: make app user DB_Owner

Case Study: Point of Sale (POS) System


Application: Point of Sale application supporting sales at 8,000 stores
Performance:

Sustain expected peak load of ~230 business transactions (checks) per second

Workload Characteristics:

230 business transactions = ~50,000 batches/sec

Heavy insert into a few tables, periodic range scans of newly added data

Heavy network utilization due to inserts and use of BLOB data

Hardware/Deployment Configuration:
Custom test harness, 12 Load Generators, 5 Application servers
Database servers: HP DL 785

48 Physical cores, 256GB RAM

Case Study: Point of Sale (POS) System (cont.)


Other Solution Requirements:

Mission critical to the business in terms of performance and

availability.
Strict uptime requirements.
SQL Server Failover Clustering for local (within datacenter) availability
Storage based replication (EMC SRDF) for disaster recovery
Quick recovery time for failover is a priority.

Observation
Initial tests showed low overall system utilization
Long duration for insert statements
High waits on buffer pages (PAGELATCH_EX/PAGELATCH_SH)
Network bottlenecks once the latch waits were resolved
Recovery times (failure to DB online) after failover under full load were between 45

seconds and 3 minutes for unplanned node failures

POS Benchmark Configuration


BL460 Blade Servers
Dell R900s , R805s
Active/Active Failover cluster

Network switch
switch
Network

Transaction DB
Server
1 x DL785
8P (quad core),
2.3GHz
256 GB RAM

12 x Load drivers:
2 proc (quad core),
x64
32+ GB memory

DL785

DL585
SAN switch

Switch
Switch Brocade 4900

(32-ports active)

5 x App
servers:
Switch

Reporting DB
Server
1 x DL585
4P (dual core), 2.6
GHz
32 GB RAM

5 x BL460
2 proc (quad
core), 32bit
32 GB memory

SAN
CX-960
(240 drives,
15K, 300GB)

Technical Challenges and Architecting for Extreme


Challenge
Consideration/Workaround
OLTP
Network
CPU bottlenecks for network processing were observed and resolved via

Concurrency

network tuning (RSS)


Further network optimization was performed by implementing compression in
the application
After optimizations were able to push ~180K packets/sec, approx 111 MB/sec
through a single 1 Gb/s NIC.
Page buffer latch waits were by far the biggest pain point
Hash partitioning was used to scale-out the btrees and eliminate the
contention
Some PFS contention for the tables containing LOB data resolved by placing LOB tables
on dedicated filegroups and adding more files

Transaction Log

No log bottlenecks were observed. When cache on the array behaves well log response
times are very low.

Database and table


design/Schema

Observed overhead related to PK/FK relationships. Insert statements required additional


work.
Adding persisted computed column needed for hash partitioning is an offline operation.
Moving LOB data is an offline operation.

Monitoring

For the latch contention, utilized dm_os_wait_stats, dm_os_waiting_tasks and


dm_db_index_operational_stats to identify indexes with most contention

Architecture/Hardware

Be careful about shared components in Blade server deployments this became a


bottleneck for our middle tier.

Hot Latches!
We observed very high waits for
PAGELATCH_EX
High = more than 1ms, we observed
greater than 20 ms
Be careful drawing conclusions just on
averages

What are we contending on?


Latch a light weight semaphore
Locks are logical (transactional
consistency)
Latches are physical (memory consitency)

Because rows are small (many fit a


page) multiple threads accesses
single page may compete for one
PAGELATCH even if there is no
lock blocking

Page (8K)
EX_LATCH

ROW
ROW
ROW

298

IX Page

INSERT VALUES
(298, xxxx)

EX_LATCH wait
EX_LATCH

ROW
299

IX Page

INSERT VALUES
(299, xxxx )

Waits & Latches


Dig into details with:
sys.dm_os_wait_stats
sys.dm_os_latch_waits
wait_type

% Wait Time

PAGELATCH_EX

86.4%

PAGELATCH_SH

8.2%

LATCH_SH

1.5%

LATCH_EX

1.0%

LOGMGR_QUEUE

0.9%

CHECKPOINT_QUEUE

0.8%

ASYNC_NETWORK_IO

0.8%

WRITELOG

0.4%

latch_class

wait_time_ms

ACCESS_METHODS_HOBT_VIRTUAL_ROOT

156,818

LOG_MANAGER

103,316

I N T E R N A L O N LY

Waits & Latches Server Level


sys.dm_os_wait_stats
select *
, wait_time_ms/waiting_tasks_count [avg_wait_time]
, signal_wait_time_ms/waiting_tasks_count
[avg_signal_wait_time]

from sys.dm_os_wait_stats
where wait_time_ms > 0
and wait_type like '%PAGELATCH%'
order by wait_time_ms desc

Waits & Latches Index Level


sys.dm_db_index_operational_stats
/* latch waits
********************************************/
select top 20
database_id, object_id, index_id, count(partition_number) [num partitions]
,sum(leaf_insert_count) [leaf_insert_count], sum(leaf_delete_count) [leaf_delete_count]
,sum(leaf_update_count) [leaf_update_count]
,sum(singleton_lookup_count) [singleton_lookup_count]
,sum(range_scan_count) [range_scan_count]
,sum(page_latch_wait_in_ms) [page_latch_wait_in_ms], sum(page_latch_wait_count) [page_latch_wait_count]
,sum(page_latch_wait_in_ms) / sum(page_latch_wait_count) [avg_page_latch_wait]
,sum(tree_page_latch_wait_in_ms) [tree_page_latch_wait_ms], sum(tree_page_latch_wait_count)
[tree_page_latch_wait_count]
,case when (sum(tree_page_latch_wait_count) = 0) then 0
else sum(tree_page_latch_wait_in_ms) / sum(tree_page_latch_wait_count) end
[avg_tree_page_latch_wait]

from sys.dm_db_index_operational_stats (null, null, null,


null) os
where page_latch_wait_count > 0
group by database_id, object_id, index_id

Hot Latches - Last Page Insert Contention


Most common for indexes which
have monotinically increasing key
values (i.e. Datetime, identity, etc..)

BBtree
tree
Pag
Pag
e
e

BBtree
tree
Pag
Pag
e
e

Our scenario
Two tables were insert heavy, by far
receiving the highest number of
inserts
INSERT mainly however there is
background process reading off
ranges of the newly added data
And dont forget
We have to obtain latches on the
non-leaf Btree pages as well.
Page Latch waits vs. Tree Page Latch
waits (sys.dm_db_index_operational
stats)

Leaf
Page
s

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
e
e
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

BBtree
tree
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
e
e
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Tree
Pages

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Dat
Dat
a
a
Pag
Pag
e
e

Logical Key Order of Index


Monotonically Increasing
Many threads
inserting into end of
range

We call this Last Page Insert Contention


Expect: PAGELATCH_EX/SH waits
And this is the observation

How to Solve INSERT hotspot


Option #1: Hash partition the table
Based on hash of a column (commonly a
modulo)
Creates multiple B-trees (each partition is a
B-tree)

Option #2: Do not use a


sequential key
Distribute the inserts all
over the B-tree

Threads inserting into end


of range but across each
partition
Threads inserting into
end of range
contention on last

2001
- 3000

3001
- 4000
INSERT

1001
- 2000

INSERT

Hash Partitioned Table / Index

0
-1000

INSERT

Before
After

INSERT

Round robin between the B-trees create


more resources and less contention

Hash Partitioning Reference:


http://sqlcat.com/technicalnotes/archive/
2009/09/22/resolving-pagelatchcontention-on-highly-concurrent-insert-

Example: Before Hash Partitioning


1
1

Latch waits of approximately 36 ms at


baseline of 99 checks/sec.

2
2

Example: After Hash Partitioning*


1
1

3
3

Latch waits of approximately 0.6 ms at


highest throughput of 249 checks/sec.

4
4
2
2

*Other optimizations were applied, Hash Partitioning


was responsible to a 2.5x improvement in insert
throughput

Table Partitioning Example


--Create the partition scheme and function
1
1 CREATE PARTITION FUNCTION [pf_hash16] (tinyint) AS RANGE LEFT FOR VALUES

(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)


2 CREATE PARTITION SCHEME [ps_hash16] AS PARTITION [pf_hash16] ALL TO ( [ALL_DATA] )
2

-- Add the computed column to the existing table (this is an OFFLINE operation of done the simply way)
- Consider using bulk loading techniques to speed it up.
ALTER TABLE [dbo].[Transaction]
ADD [HashValue] AS (CONVERT([tinyint], abs(binary_checksum([uidMessageID ])%(16)),(0)))
PERSISTED NOT NULL
3
3

--Create the index on the new partitioning scheme

Note: Requires
application
changes
CREATE
UNIQUE CLUSTERED
INDEX
[IX_Transaction_ID] ON [dbo].[Transaction([Transaction_ID ],
[HashValue])
ON ps_hash16(HashValue)
Ensure
Select/Update/Delete
have appropriate partition elimination

Network Cards Rule of Thumb


At scale, network traffic will generate a LOT of interrupts for
the CPU
These must be handled by CPU Cores
Must distribute packets to cores for processing

Tuning a Single NIC Card POS system


Enable RSS to enable multiple CPUs to process receive
indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx

The next step was to disable the Base Filtering Service in


Windows and explicitly enable TCP Chimney offload.
Turning off Base Filtering Service huge reduction in CPU may
not be suitable for all production environments
Careful with Chimney Offload as per KB 942861

Before and After Tuning Single NIC


1
1

2
2

3
3

Single 1 Gb/s NIC

1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly

enabling TCP Chimney Offload CPU time on CPU0 was reduced. The
base CPU for RSS successfully moved from CPU0 to another CPU.

ToCom+
DTC
or
not
to
DTC:
POS
System
transactional applications are still prevalent today
This results in all database calls enlisting in a DTC transaction
45% performance overhead
Scenario in the lab involved two Resource Managers MSMQ and SQL:
wait_type

total_wait_time_ms

total_waiting_tasks_count

average_wait_ms

DTC_STATE

5,477,997,934

4,523,019

1,211

PREEMPTIVE_TRANSIMPORT

2,852,073,282

3,672,147

776

PREEMPTIVE_DTC_ENLIST

2,718,413,458

3,670,307

740

Tuning approaches
1.
2.

Optimize DTC TM configuration (transparent to app)


Remove DTC transactions (requires app changes)
Utilize System.Transactions which will only promote to DTC if more than one RM is
involved
See Lightweight transactions:
http://msdn.microsoft.com/en-us/magazine/cc163847.aspx#S5

Optimizing DTC Configuration

Default application servers use local TM (MSDTC Coordinator)


Introduces RPC communication between SQL TM and App Server TM
App virtualization layer incurs some delay
Configuring application servers to use remote coordinator removes RPC communication
See Mike Ruthruffs paper on SQLCAT.COM:
http://
sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx

Recap: Session Objectives and


Session Objectives:
Takeaways

Learn about SQL Server capabilities and challenges experienced by some of our
extreme OLTP customer scenarios.
Insight into diagnosing and architecting around issues with Tier-1, mission
critical workloads.

Key Takeaways
SQL Server can meet the needs of many of the most challenging OLTP scenarios
in the world.
There are a number of new challenges when designing for high end OLTP
systems.

Applied Architecture Patterns on the Microsoft Platform

Q &A
46

2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

47

Agenda

Windows Server 2008R2 and SQL Server 2008R2 improvements


Scale architecture

Customer Requirements

Hardware setup
Transaction log essentials

Getting the code right


Application Server Essentials
Database Design

Tuning Data Modification


UPDATE statements
INSERT statements
Management of LOB data

The problem with NUMA and what to do about it

Final results and Thoughts


48

Top statistics
Category
Largest single database
Largest table

Metric
80 TB
20 TB

Biggest total data 1 customer

2.5 PB

Highest write per second 1 db

60,000

Fastest I/O subsystem in production


(and in lab)
Fastest real time cube

18 GB/sec
(26GB/sec)
1 sec latency

data load for 1TB


Largest cube

20 minutes
12 TB
49

Upping the Limits


Previous (before 2008R2) windows was limited to 64 cores
Kernel tuned for this config

With Windows Server 2008R2 this limit is now upped to 256

Cores (plumbing for 1024 cores)


New concept: Kernel Groups
A bit like NUMA, but an extra layer in the hierarchy

SQL Server generally follows suit but for now, 256 Cores is

limit on R2
Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up

to 256 Cores)
And largest IA-64 is 256 Hyperthread (at 128 Cores)
50

The Path to the Sockets


Windows OS

Hardware
NUMA 6

Kernel
Group 0
Kernel
Group 1

NUMA 0

NUMA 2

NUMA 4

NUMA 6

NUMA 1

NUMA 3

NUMA 5

NUMA 7

NUMA 8

NUMA 10

NUMA 12

NUMA 14

NUMA 9

NUMA 11

NUMA 13

NUMA 15

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

NUMA 7

Kernel
Group 2

NUMA 16
NUMA 17

NUMA 19

NUMA 21

NUMA 23

Kernel
Group 3

NUMA 24

NUMA 26

NUMA 28

NUMA 30

NUMA 25

NUMA 27

NUMA 29

NUMA 31

NUMA 18

NUMA 20

NUMA 22

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

CPU
Socket

CPU Core
HT
HT

CPU Core
HT
HT

51

And we measure it like this


Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx

Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect.

52

And it Looks Like This...

53

Customer Scenarios
Core Banking

Healthcare System

POS

Workload

Credit Card
transactions from
ATM and Branches

Sharing patient information


across multiple healthcare
trusts

World record deployment


of ISV POS application
across 8,000 US stores

Scale
Requirements

10.000 Business
Transactions / sec

37,500 concurrent users

Handle peak holiday load


of 228 checks/sec

Technology

App Tier .NET


3.5/WCF
SQL 2008R2
Windows 2008R2

App Tier: .NET


SQL 2008R2
Windows 2008R2

Virtualized App Tier: Com+,


Windows 2003
SQL 2008, Windows 2008

Server

HP Superdome
HP DL785G6

IBM 3950 and HP DL 980

DL785

54

Network Cards Rule of Thumb


At scale, network traffic will generate a LOT of interrupts for

the CPU
These must be handled by CPU Cores
Must distribute packets to cores for processing

Rule of thumb (OTLP): 1 NIC / 16 Cores


Watch the DPC activity in Taskmanager
In Windows 20003 remove SQL Server (with affinity mask) from the NIC cores

55

Lab: Network Tuning Approaches


1. Tuning configuration options of a single NIC card to provide

the maximum throughput.


2. Improve the application code to compress LOB data before

sending it to the SQL Server


3. Team a pair of 1 Gb/s NICs to provide more bandwidth

(transparent to the app).


4. Add multiple NICS (better for scale )

56

Tuning a Single NIC Card POS system


Enable RSS to enable multiple CPUs to process receive

indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx

The next step was to disable the Base Filtering Service in

Windows and explicitly enable TCP Chimney offload.


Careful with Chimney Offload as per KB 942861

57

Before and After Tuning Single NIC


1
1

2
2

3
3

1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP

Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS
successfully moved from CPU0 to another CPU.
58

SQL Server Configuration Changes


As we increased number of connections to around 6000 (users had

think time) we started seeing waits on THREADPOOL


Solution: increase sp_configure max worker threads
Probably dont want to go higher than 4096
Gradually increase it, default max is 980
Avoid killing yourself in thread management bottleneck is likely somewhere else

Use affinity mask to get rid of SQL Server for cores running NIC traffic
Well tuned, pure play OLTP
No need to consider parallel plans
Sp_configure max degree of parallelism, 1

60

Designing Highly Scalable OLTP Systems

Getting the Code Right

61

Things to Double Check


Connection pooling enabled?
How much connection memory are we using?
Monitor perfmon: MSSQL: Memory Manager

Obvious Memory or Handle leaks?


Check perfmon Process counters in perfmon for .NET app
Server side processes will keep memory unless under pressure

Can the application handle the load?


Call into dummy procedures that do nothing
Check measured application throughput
Typical case: Application breaks before SQL

63

Remote Calling from WCF


Original client code: Synchronous calls in WCF
Each thread must wait for network latency before proceeding
Around 1ms waiting
Very similar to disk I/O thread will fall asleep
Lots of sleeping threads
Limited to around 50 client simulations per machine

Instead, use IAsyncInterface

64

Designing Highly Scalable OLTP Systems

Tuning Data Modification

65

Database Schema Credit Cards


UPDATE SET Balance
UPDATE ..
SET LastTransaction_ID = @ID + 1
LastTransactionDate = GETDATE()

Account
INSERT .. VALUES (@amount)
INSERT .. VALUES (-1 * @amount)

ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID

10**3 rows

Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount

Account_ID
LastUpdateDate
Balance

10**5 rows

10**10 rows

66

Summary of Concerns

Transaction table is hot


Lots of INSERT
How to handle ID numbers?

ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID

Allocation structures in database

Account table must be

Account
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount

Account_ID
LastUpdateDate
Balance

transactionally consistent with Transaction


Do I trust the developers to do this?
Cannot release lock until BOTH are in sync
What about latency of round trips for this

Potentially hot rows in Account


Are some accounts touched more than others

ATM Table has hot rows.


Each row on average touched at least ten times per second
E.g. 10**3 rows with 10**4 transactions/sec
67

Generating a Unique ID
Why wont this work?
CREATEPROCEDUREGetID
@IDINTOUTPUT
@ATM_IDINT
AS
DECLARE@LastTransaction_IDINT
SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=@ATM_ID
SET@ID=@LastTransaction_ID+1
UPDATEATM
SET@LastTransaction_ID
WHEREATM_ID=@ATM_ID
68

Concurrency is Fun

ATM
ID_ATM = 13
LastTransaction_ID = 42

SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=13

(@LastTransaction_ID=42)

SELECT@LastTransaction_ID=LastTransaction_ID
FROMATM
WHEREATM_ID=13

(@LastTransaction_ID=42)
SET@ID=@LastTransaction_ID+1

UPDATEATM
SET@LastTransaction_ID=@ID
WHEREATM_ID=13

SET@ID=@LastTransaction_ID+1

UPDATEATM
SET@LastTransaction_ID=@ID
WHEREATM_ID=13

69

Generating a Unique ID The Right way


CREATEPROCEDUREGetID
@IDINTOUTPUT
@ATM_IDINT
AS
UPDATEATM
SETLastTransaction_ID=@ID+1
,@ID=LastTransaction_ID
WHEREATM_ID=@ATM_ID

And it it is simple too...

70

Hot rows in ATM


Initial runs with a few hundred ATM shows excessive waits

for LCK_M_U
Diagnosed in sys.dm_os_wait_stats
Drilling down to individual locks using sys.dm_tran_locks
Inventive readers may wish to use Xevents
Event objects: sqlserver.lock_acquired and sqlos.wait_info
Bucketize them

As concurrency increases, lock waits keep increasing


While throughput stays constant
Until...

71

Spinning around
1.00E+14

20000
18000

1.00E+12

16000
1.00E+10

14000
12000

1.00E+08
Spins

10000
1.00E+06

Throughput

8000

lg(Spins)
Throughput

6000

1.00E+04

4000
1.00E+02

2000

1.00E+00
0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0
100000

Requests

Diagnosed using sys.dm_os_spinlock_stats


Pre SQL2008 this was DBCC SQLPERF(spinlockstats)
Can dig deeper using Xevents with sqlos.spinlock_backoff event
We are spinning for LOCK_HASH
72

LOCK_HASH what is it?

More Threads

LO

H
_
CK

H
S
A
Lock Manager

ROW

Thread

LCK_U

- Why not go to sleep?

73

Locking at Scale
Ratio between ATM machines and transactions generated

too low.
Can only sustain a limited number of locks/unlocks per second
Depends a LOT on NUMA hardware, memory speeds and CPU caches
Each ATM was generating 200 transactions / sec in test harness

Solution: Increase number of ATM machines


Key Takeway: If a locked resource is contended create more of it
Notice: This is not SQL Server specific, any piece of code will be bound by

memory speeds when access to a region must be serialized

74

Hot rows in Account


Three ways to update Account table
1) Let application servers invoke transaction to both insert in

TRANSACTION and UPDATE account


2)

Set a trigger on TRANSACTION

3) Create stored proc that handles the entire transaction


) Option 1 has two issues:
App developers may forget in all code paths
Latency of roundtrip: around 1ms i.e. no more than 1000 locks/sec possible on single

row

) Option 2 is better choice!


) Option 3 must be used in all places in app to be better than option 2.
76

Hot Latches!

LCK waits are gone, but we are seeing


very high waits for PAGELATCH_EX
High = more than 1ms

What are we contending on?

Latch a light weight semaphore


Locks are logical (transactional

consistency)
Latches are internal SQL Engine

Page (8K)

ROW
ROW

PAGELATCH_EX

LCK_U

ROW
ROW

LCK_U

(memory consitency)

Because rows are small (many fit a


page) multiple locks may compete for
one PAGELATCH

77

Row Padding

In the case of the ATM table, our rows


are small and few

We can waste a bit of space to get


more performance

Solution: Pad rows with CHAR column


to make each row take a full page

Page (8K)

ROW

CHAR(5000)

PAGELATCH_EX

LCK_U

1 LCK = 1 PAGELATCH
ALTERTABLEATM
ADDCOLUMNPaddingCHAR(5000)NOTNULL
DEFAULT(X)

78

INSERT throughput
Transaction table is by far the most active table
Fortunately, only INSERT
No need to lock rows
But several rows must still fit a single page

Cannot pad pages there are 10**10 rows in the table


A new page will eventually be allocated, but until it is,

every insert goes to same page


Expect: PAGELATCH_EX waits
And this is the observation

79

Hot page at the end of B-tree with increasing index

80

Waits & Latches


Dig into details with:
sys.dm_os_wait_stats
sys.dm_os_latch_waits

wait_type

% Wait Time

PAGELATCH_SH

86.4%

PAGELATCH_EX

8.2%

LATCH_SH

1.5%

LATCH_EX

1.0%

LOGMGR_QUEUE

0.9%

CHECKPOINT_QUEUE

0.8%

ASYNC_NETWORK_IO

0.8%

WRITELOG

0.4%

latch_class

wait_time_ms

ACCESS_METHODS_HOBT_VIRTUAL_ROOT

156,818

LOG_MANAGER

103,316

81

How to Solve INSERT hotspot


Hash partition the table

Do not use a sequential key

Create multiple B-trees

Distribute the inserts all over

Round robin between the B-trees

the B-tree

create more resources and less


contention

3,11,19
4,12,20
5,13,21
6,14,22
7,15,23

0
-1000

1001
- 2000

2001
- 3000

3001
- 4000
INSERT

2,10,18

INSERT

1,9,17

0
1
2
3
4
5
6
7

INSERT

0,8,16

hash

INSERT

ID

82

Design Pattern: Table Hash Partitioning

Create new filegroup or use existing to hold the


partitions

Use CREATE PARTITION FUNCTION command

Partition the tables into #cores partitions

Use CREATE PARTITION SCHEME command

Equally balance over LUN using optimal layout

Bind partition function to filegroups

hash

0
1
2
3
4
5
6

Add hash column to table (tinyint or smallint)

Calculate a good hash distribution

For example, use hashbytes with modulo or


binary_checksum

253
254
255
83

Lab Example: Before Partitioning


1
1

Latch waits of approximately 36 ms at


baseline of 99 checks/sec.

2
2

84

Lab Example: After Partitioning*


1
1

3
3

Latch waits of approximately 0.6 ms at


highest throughput of 249 checks/sec.

4
4
2
2

*Other optimizations were applied


85

B-Tree Root Split


Virtual
Root

(ACCESS_METHODS
HBOT_VIRTUAL_ROOT)

SH
SHH
SEX

PAGELATCH

SSH
EHX

SH
EX

PAGELATCH

PAGELATCH

EX

EX

EX

Prev

Next
X

LATCH

LCK

87

NUMA and What to do


Remember those PAGELATCH for UPDATE statements?
Our solution: add more pages
Improvemnet: Get out of the PAGELATCH fast so next one

can work on it
On NUMA systems, going to a foreign memory node takes

at least 4-10 times more expensive


Use SysInternals CoreInfo tool

89

How does NUMA work in SQL Server?

The first NUMA node to request a page will own that page
Ownership continues until page is evicted from buffer pool

Every other NUMA node that need that page will have to do foreign memory access

Additional (SQL 2008) feature is SuperLatch


Useful when page is read a lot but written rarely
Only kicks in on 32 cores or more
The this page is latched information is copied to all NUMA nodes
Acquiring a PAGELATCH_SH only requires local NUMA access
But: Acquiring PAGELATCH_EX must signal all NUMA nodes
Perfmon object: MSSQL:Latches
Number of SuperLatches
SuperLatch demotions / sec
SuperLatch promotions / sec

See CSS blog post


90

Effect of UPDATE on NUMA traffic


ATM_ID
NUMA 0

UPDATEATM
SETLastTransaction_ID

NUMA 1

UPDATEATM
SETLastTransaction_ID

NUMA 2

2
NUMA 3

UPDATEATM
SETLastTransaction_ID

App Servers

UPDATEATM
SETLastTransaction_ID

91

Using NUMA affinity


ATM_ID
NUMA 0

UPDATEATM
SETLastTransaction_ID

NUMA 1

UPDATEATM
SETLastTransaction_ID

NUMA 2

2
NUMA 3

UPDATEATM
SETLastTransaction_ID

UPDATEATM
SETLastTransaction_ID

Port: 8000

Port: 8001

Port: 8002

Port: 8003

How to: Map TCP/IP Ports to NUMA Nodes


92

Final Results and thoughts


120.000 Batch Requests / sec
100.000 SQL Transactions / sec
50.000 SQL Write Transactions / sec
12.500 Business Transactions / sec
CPU Load: 34 CPU cores busy
Given more time, we would get the CPUs to 100%, Tune the NICs more, and

work on balancing NUMA more.


And of NIC, we only had two and they were loading two CPU at 100%

93

Case Study: Online Gaming


Application: Online sports betting, poker and casino play.
Performance:
15 million page views, 980,000 users per day
Over 30 thousand database transactions per second, 500+ billion per day
450,000 SQL Statements/sec on single database

Workload Characteristics:
Multiple systems comprise the gaming experience including payment, casino

games, sportsbook, etc


Require very low latency and must meet high transaction volumes - based on

number of users on system


Over 100 SQL Server instances and 1,400 databases in architecture

Hardware/Deployment Configuration:
Scale-up the payment system. HP Superdome (32-socket, 2 core; 256GB).

Investigating x64
Co-operative Scale-out for actual gaming activity

Case Study: Online Gaming (cont.)


Other Solution Requirements:

Failure is not an option


Zero data loss and achieved 99.998% availability
Use database mirroring and log shipping across datacenters to achieve
HA/DR goals
http://

sqlcat.com/whitepapers/archive/2010/06/07/proven-sql-server-architectures-for-high-av
ailability-and-disaster-recovery.aspx
http://
sqlcat.com/whitepapers/archive/2010/11/03/failure-is-not-an-option-zero-data-loss-and
-high-availability.aspx

Use SQL Server replication for reporting

Observation
Large scale of users, with low latency requirements
Hot spots on heavily hit tables - page latching
Scale-out helped increase transaction volume (#/sec)

Online Gaming Infrastructure


(no HA, DR & Backup shown)

This is not a WinMo7 its


a SuperDome
Betcache
4+
Casino
2+

VS
Games
2+

1x2
Games
12+

CMS
15+
News
Letter 2+
Other
30+

User Account &


Sportsbook
8+
Bookmaking
2+

BGI

CSM
2+

Other
40+

Payment
20+

Repl

ASP.NET
Sessions
8+

SMS
4+
Other
20+

DWH
Stage
50+

Internal
Office,
Sharepoint
(300+)

DWH
60+

Moni-toring
10+

OLAP
10+

Administration
20+

Technical Challenges and Architecting for Extreme


OLTP

Challenge

Consideration/Workaround

Network

CPU bottlenecks for network processing were observed and resolved via network tuning (RSS)
Dedicated networks for backup, replication etc
8 network cards for clients

Concurrency

Latch contention on heavily used tables, last page insert


Hash partition option caused other problem in query performance and application design.
Resolution: Co-operative scale-out

Transaction Log

Latency on log writes


Resolution: Increased throughput/decreased latency by placing transaction log on SSDs
Database mirroring overhead very significant on Synchronous

Database and table


design/Schema

Monitoring

Security monitoring (PCI and intrusion detection) between 10%-25% impact/overhead when
monitoring

Architecture/Hardware

Tests using x64 (8-socket; 8-core) vs. Itanium-Superdome(32-socket,dual-core)


Same transaction throughput
IO and backups were a bit slower

Latency on IO intensive data files (including Tempdb):


Resolution: Session state database on SSDs;
Resolution: Betting slips/customer databases testing sharding
Single server, single database 500/tx/sec
Single server, 4 databases 1,800/tx/sec (sharding)
Multiple servers 2,600/tx/sec (sharding)

Case Study: Financial Stock Market


Application: Real-time high transaction, low latency stock quoting
Performance:
Over 280,000 business transactions/sec
Over 1 million data manipulation calls per second in single database
Latency per business transaction under 1 millisecond.
Real-time nature of data flow.

Workload Characteristics:
Send large batch containing multiple business transactions, parse and end up

inserting all records into a large table which is constantly read


Under 20 tables in the application. Nothing generic about code/solution

Hardware/Deployment Configuration:
Load distributed based on alphabetical split.
Co-operative Scale-out. Commodity based hardware (2-socket, quad-core pre-

Nehalem) and high performance SAN.

Case Study: Financial Stock Market (cont.)


Other Solution Requirements:
Mission critical to the business in terms of performance and availability.
Require 99.999% uptime overall and 100% during business day.

Treat system like their mainframe operations


Utilize SQL Server HA features to help support 5 9s uptime requirement

and geographical redundancy


SQL Server Failover Clustering for local (within datacenter) availability
Database Mirroring (High Availability/Async) for geo-availability
Locations around 300 miles apart
30MB/sec log generation with no-send queue

Observation
Extreme low latency and high throughput requirements with machine born data

lead to hitting a number of the same bottlenecks we observed more commonly in


the scale-up scenarios.

Stock Market Architecture (1)

Stock Market Architecture (2)

Technical Challenges and Architecting for Extreme


Challenge
Consideration/Workaround
OLTP
Network
Network round trip time for synchronous call from client induced latency
Concurrency

Resolution: Batch data into single large parameter (varchar (8000)) to avoid network
roundtrips

Page Latch Contention Small table


Small table with latching on 36 Rows on single page.
Resolution: pad the rows to spread out latching to multiple pages; Performance
Gain: 20%
Page Latch Contention Large Table
Concurrent INSERTS into incremental column (identity), last page insert
Resolution: Clustered Index (partition_id & identity) column; Performance Gain:
30%
Heavy, long running threads contenting for time on the scheduler
Resolution: Map TCP/IP Ports to NUMA Nodes (
http://msdn.microsoft.com/en-us/library/ms345346.aspx); Performance
Gain: 20%

Transaction Log

Database and table


design/Schema

Monitoring

Logwaits:
Resolution: Batching business transactions within a single COMMIT to avoid
WRITELOG waits
Test of SSDs for log helped with latency.

Change decimal datatypes to money, others to int


Integer based datatypes go through optimized code path; Performance Gain:
10%
No RI as this has an overhead on performance. Executed in the application.

5% overhead in running default trace alone


Collect perfmon and targeted DMV/Xevents output to repository