Sie sind auf Seite 1von 27

EMC World – May 2010

Navisphere Analyzer

Purpose of the script: EMC Navisphere Analyzer allows you to view storage system performance statistics in various
types of charts. These charts can help you find and anticipate bottlenecks in the disk storage component of a computer
system. Today’s session will take a look at the Navisphere Analyzer User Interface. In particular, it will be covering the
different views available and provide some basic starting point in checking if your existing configuration is being stressed
or working in a well utilized manner. This script is designed for use with Navisphere Manager 6.x software.

Please do not alter any of the workstation or CLARiiON Storage System configuration details unless instructed
to do so by these instructions or by a member of the EMC presentation team.

In this session there are primarily two exercises covering archive retrieval, viewing and on-array real time analysis.
In addition to the instructions for these exercises, you will find more exercises and reference material in this handout.
If you have time to do so, please explore those additional sections during this session.

Before you begin:


Fill in the following information:
a. Assigned Array from the Desktop ICON array.txt on your laptop: Array _________ SP ___
b. Storage System IP address to use ___.___.___.___
c. Proceed to exercise A

Exercise A, NAR file viewing offline


This exercise is to direct you into checking the status of data logging on your array, retrieving an
archive file containing performance data, and then looking at that data using Navisphere off-array
software. This is primarily a walk-through exercise to get familiarity with the steps involved in
performance analysis. The second exercise will cover more details about the metrics you are looking
at.
The NAR file is from a test environment where a total of 12 tests were run. You will clearly be able to
segment the statistics into 12 areas. The odd numbered tests were using a single thread to each LUN
and the even numbered tests were using 4 threads per LUN. The essence of this exercise is to get
familiarity with the interface, as well as identify that increasing load on the array has various effects
What you do Notes & observations
Start Internet Explorer <Enter the IP The process of getting started with Navisphere Manager is simple.
address of your assigned managed You begin by pointing your browser at the storage system’s IP
node ( SP IP address ) into the address.
address window of the browser>

1
EMC World – May 2010

Login <Enter the user as emcw and


password emcw>

You will be presented with the standard Navisphere view of the


Domain you logged into.
<Select Tools –>Analyzer -> Data
Logging>

Check the logger is running and that


periodic archiving is as you want it. If
checked it saves the archives every 5
hours with the default 120 second
sampling, or 2.5 hours if using 60
second sampling.

If you do not have Analyzer installed


on the array, you can still invoke
logging but this will be limited to 7
days of periodic archiving and will
create encrypted archives for service
use.

With Rel-24 and above, the logger will automatically enable or


<Select Cancel>
disable statistics logging as required.
If running Pre-version 24 of array
Statistics logging is the process whereby the array will collect
code, the logging feature operated
statistics for each object within the storage system. The logger is
differently so you will have to
required to facilitate collection of those statistics, however if the
manually enable statistics logging at
logger isn’t running, you can view a subset of statistics using the SP
the SP level.
Properties view within Navisphere, or collect some raw statistics
using secure CLI commands.

2
EMC World – May 2010

<Select Tools ->Analyzer -> Archive


-> Retrieve>

The dialogue will present the current


repository contents for the selected
SP.

<Scroll down to find the file called


emcw_2010_xxx-xx.nar
Select that file and click on Retrieve>

Note the location where the file will be


stored.

You will see the status of the operation


in the lower pane as the file is
uploaded to your workstation.

<Close this dialogue with Done, then


close the current browser instance>

The newest file listed could be up to 5.5 hours old so you may need to
Create New to force the logger to create a new archive containing
recent statistical data from its buffer.
You have the option of retrieving archives from SP-A or SP-B, and
although they should contain almost identical data, it is worth
retrieving from both SP’s in case there’s any problem with viewing
one of the files, or one SP may have been rebooted during an archive
and will miss samples during that time.

3
EMC World – May 2010

What you do Script


We are now going to use off-array
Navisphere to view the archive we We could use the array to view the archive, however usual practice is
retrieved in the previous operation. to view archives independent to having an array resource available

Recommended software components


for off-array operations to enable
viewing Analyzer data.

Ensure the Navisphere Management


Server service is running (Start,
Settings, Control Panel,
Administrative Tools, Services).

Service is called NaviGovenor Note; this is only required for Off Array management and offline
Analyzer Archive file viewing.

<Start Off-Array UI - double click


desktop ICON labeled
OffArrayUI>
If not available, you can explicitly run the off array management UI by
selecting START, RUN and pointing to the following
link;"C:\ProgramFiles\EMC\ManagementUI\6.29.x.x\WebContent
\start.html"

<Enter Management Server IP


address 127.0.0.1 and use default
port of 80/443>

127.0.0.1 is the localhost address.


Alternatively you can use the IP
address assigned by the DHCP
server.

Login <Enter the user as emcw and You will be presented with the standard Navisphere view of the
password emcw> Domain you logged into.

4
EMC World – May 2010

What you do Script


<Select Tools –>Analyzer ->
Customize>
Customize is only
In the General TAB view, check that required once for the
the Advanced box is ticked. off-array
environment.

Customize is also
available for the
In the Archives TAB view, check the array environment so
default path for archives, and the when you set an
Performance Survey for the initial array option, it will
view, and check the Initially Check remain set for
All Tree Objects box is ticked. anyone logging into
the array for viewing
When you have many objects you may real-time data
wish to be more granular on covered in the next
selections and chose not to initially exercise.
check all objects.

In the Survey Charts TAB view,


check the Utilization, Response Time
and Average Queue Length are
selected and the values shown for
each are present.
Normally we might
<Click OK> to use these suggest a threshold
of 10 samples but for
settings.
this exercise we’ll
use 4 for the off-
array archive file
were looking at.
Note: When viewing the analyzer
standard performance detail view,
there are 4 windows in the view.
These are the object (top left), value
(bottom left), plot (top right) and plot
item list (bottom right). To display
the values available for a given
object, you must select that object.
Selecting a plot item will highlight the
plot associated with that selection.
This will be useful when selecting
many items to view. We’ll see this view later in the exercise.

5
EMC World – May 2010

What you do Script


<Select Tools –>Analyzer -> Archive
-> Open>
You can merge nar files to
cover a longer period of time
Open file emcw2010_1_xx-xx.nar .
however when opening a large
Use the default time points. Select the
file in location “C:\Documents and nar file, it can speed up the
interface by focusing on a
Settings\emcperf”.
shorter time period.
The merge option is referenced
Leave the default start and end time
for this exercise however when doing later in this paper.
this on a NAR file from your own
It is not recommended to
storage system, you may want to
narrow the time display to make it merge SP-A and SP-B nar files
easier to view specific activity. together as they should be very
<Click OK> similar data anyway

You should now have the


Performance Survey View if you setup
the default open view as shown in
prior steps.

<Scroll in this view to see if anything


is highlighted by red boxes, indicating
a threshold set in the configuration
has been exceeded>.

A RED or YELLOW Utilization box


maybe an indication for concern,
especially if the Response Time
and/or Queue Length are also RED. Make some notes on what you see in the Survey Chart.
Make a note of any suspected LUNs in
the space below;

____________________________

6
EMC World – May 2010

What you do Script


<With the pointer in the Utilization
display for LUN 50, either double
click, or right click and select
Performance Detail. You will now see
a graph showing LUN 50 Utilization>

From this display, we want to check


some things out;
SP Utilization – they should be about
the same if the load is well balanced.

Uncheck the LUN to increase the scaling of the SP Utilization if


<You need to expand the LUN necessary. Is the load between the two SP’s balanced?
using the + to see the SP check
box – check this to add SP detail>

<In the Performance Detail View,


right click on the SP and select Check total memory allocated to
Properties. Check that under the read and write cache and that both
Cache Tab, Read and Write cache are are enabled. A reference to
both enabled, and check the cache allocation of cache memory can be
page size> found in the CLARiiON Best
Practices Guide, although typically
recommended to reserve up to 20%
of total available cache to read
cache, and the rest for write cache.

Now we want to check the cache dirty


pages – these are pages of data held If the Dirty Pages (%) were
in memory during writing. If these are consistently high, this maybe
very high, it could be an indication of an issue we’d need to look at
a problem we need to work on. closer. Maybe the watermark
settings would need
<make sure you have the SP selected changing? Maybe the cache
with the pointer i.e. SP A is shown allocation would need
highlighted. Now scroll in the lower changing? You need to
left window down to the Dirty Pages consider the write load on the
(%) property – check the box> array and duration, combined
with distribution at the disk
Write cache works on a policy of level on the backend; after all,
watermarks. Here we see the dirty pages the disks govern the speed at
around 60% to 80% which indicate the which we can de-stage data
watermark processing is working well. from dirty pages. Adding
You can see how the watermarks are set more spindles to a particular
by selecting the SP Tab then right application can help de-stage
clicking the array and selecting data quicker for write
Performance Overview view. intensive applications.
<Now uncheck both SP’s and the
Dirty Pages box>

7
EMC World – May 2010

What you do Script


Now let’s look at the LUN details.

<Click once on LUN 50, then in lower


left window, un-check the utilization
then check both the Read Bandwidth
and Write Bandwidth boxes.>

Simultaneous reading and writing on


a single LUN can be challenging.
Let’s look at more detail.

<Now select the property Forced


Flushes/s>

If none seen, good reason to re-check Although we don’t have any forced flushes here, write throughput is
that cache is on, although none is also the reported write cache hits combined with any forced flushes i.e. a
indicating write cache not being write causing page(s) of write cache to be flushed to make room for
worked too hard. the write do not count in the write cache hit total (unless the write size
satisfies the write-aside and bypasses cache – more performance
architecture understanding required if that wasn’t understood).

Let’s just check the LUN properties.

<Right click on LUN 50 and select


Properties>

You want to know RAID type, number


of disks and user capacity.
Also check the stripe Element Size is
as expected – 128 is normal for
striped raid as per the CLARiiON
Best Practices Guide.
Under the Cache TAB you may see the read and write cache enabled
boxes empty – this can indicate cache wasn’t enabled for the LUN but
in this instance it was. Always check the current code release notes for
known issues with the interface.

You should also check what LUNs


share the same disks to see if multiple
hot LUNs are due to disk contention
on the backend.

<In the Performance Detail view,


click on the Storage Pool TAB at the
top – then you can expand each RAID
Group to see what LUNs share the
same disks> As you will see, some of the suspected RED LUNs shown in the
Performance Summary view are sharing disks in the second half of the
test

8
EMC World – May 2010

What you do Script


<Select the LUN TAB at the top of
the window>

<Now deselect Forced Flushes/s,


Read Bandwidth and Write
Bandwidth>

<Select Read Throughput I/O/Sec


and Read Cache Hits/s>

None would indicate either random


access or reads too big for pre-fetch.

Check the IO size to the LUN.

<Right click on LUN 50 and select IO


Size Distribution Summary>

<In the view, you can quickly select


all values by right clicking in the left
pane and click on Select All –
Values>
You can see here all IO’s are small (4KB), both read and write – this
indicates a totally random profile as no read cache hits were seen.

<Close the IO Distribution Summary


window> and <de-select Read
Throughput I/O/Sec and Read Cache As we saw from the
Hits/s> IO Distribution
Summary the writes
<With LUN 50 selected, select the were small and no Full
Full Stripe Writes/s check box> Stripe Writes taking
place; this suggests the
<None of the LUNs are doing FSW’s writes are also random
in this archive> as no full stripe
coalescing in cache is
Another useful view would be the IO taking place.
Size distribution detail.
We could get some
<Right click LUN 50 and select IO write IO coalescing
Size Distribution Detail> taking place resulting
in larger than 4KB
<Select Read and Write IO size of writes at the disk layer
4KB only as we confirmed that as the – we would need to
only IO size used by checking the IO check write IO size at
Size Distribution Summary view the disk to validate
previously> that
This view is useful to see the read/write ratio over time.

9
EMC World – May 2010

What you do Script


<Close the IO Distribution Detail
window. Also, with LUN 50 selected,
High disk utilization means
uncheck the Full Stripe Writes/s box.
we are working the disks
Also, uncheck the LUN 50 box as
well. Low utilization would
well>
indicate additional load
could be placed on the
Now, expand LUN 50, we can see 5
drives with consistent
disk drives.
service and response times
<Select the last disk check box, then
in the parameter window, select
Utilization and Average Seek
Distance>
What do we see here; disk seeks are a few GB indicating a moderate
level of randomness. Uncheck Utilization to get a better view of the
seek distance (or zoom in).
<Uncheck Utilization and Average
Seek Distance> Some 32KB writes
<Select LUN 50 and then check maybe seen that
Write Size> could be
protection bits
As you can see here, over time the disk being set rather
IO size tracks the LUN IO size – again, than coalesced
indicating random workload with little
user data
or no coalescing of data. If disk IO’s
were bigger, a good indication of
coalescing taking place – always check Don’t forget to check you have write cache enabled (LUN and Array).
you have write cache enabled.

<Uncheck Write Size and also


uncheck LUN 50>

<Check both of the first 2 disks in


LUN 50 and then check the Total
Throughput for these disks>

The disks at varying times are working very hard. We can see them
reaching over 350 IOPs per disk. Now, for small random IO we have a
rule of thumb (ROT) stating a 15K rpm disk can be used for 180 IOPs
for mixed random load with good response time. When running them
at higher loads we can expect an impact in the response time observed.

10
EMC World – May 2010

What you do Script


<Uncheck Total Throughput and also
uncheck the second disk drive>

<With the highlighted first disk dive,


check the Queue, Average Busy
Queue Length and Service Time
boxes>

Average Busy Queue compared to the


regular Queue can give an indication
of burstyness however at the disk
level, activity includes de-staging The disk service time is how long the disk is taking to service each
writes from cache that can arrive in request. If IO gets queued at the disk, the disk response time is then a
bursts. factor of the service time multiplied by the queue depth. Therefore, the
higher the queue, the longer the response time.
A point to remember though is that writes will typically be serviced by
cache and have a very fast response time. Reads will have a more
Always check the release notes for directly impacted response time as disk queues increase. Of course,
known issues relating to accuracy of this all depends on IO size and also how writes are being de-staged
statistics.
from cache and how efficient cache is working to optimize that
process.
<Uncheck Service Time>

<Uncheck Queue>

<Check Response Time>

You can see the Response Time follows the average busy queue depth
as the service time was pretty stable.

<Uncheck all objects>

Now select the SP TAB, click on SP-A


and select Total Throughput.

You can see that as the load increases,


so does the overall capability of the
array i.e. more threads per test, higher
IOPs. More LUNs tested, more IOPs.
The second set of tests where you see
the IOPs starting around 270 Those peaks seen at the start of tests are normally writes being
increasing steadily to 1286 are where absorbed into the protected write cache until watermark processing
both SP’s are being used, so the starts to write data to disks on the back end.
aggregate IOPs would be higher still.

11
EMC World – May 2010

What you do Script


<Uncheck SP A and Total
throughput>

Select the LUN Tab.

Now check LUNs 60 and 61. These


are both using the same disks and you
can see the impact when both LUNs
are under load as the LUN response
time increases.
The other small peaks seen here are associated with disk
Select the disks for LUN 60. statistics and the SNiiFER process where there is no host load
accessing the disks. This is a process that is validating data
This results in fewer aggregate IOs availability in background (performing 512KB read operations
across both SP’s due to an increased at 1 IO/s. Take a look at disk read size and you will see.
queue as well as small increase in
seek distance at the drive level.

12
EMC World – May 2010

What you do Script


Summary of Exercise A
We got you to look at LUN 50 as it
was used in all 12 tests. The first test
area you see we get moderate
throughput at the disks but when we
increase the threads accessing the
same disks, we get much better work
from them. The detrimental effect of
driving a higher load is an increase in
response time to the application due
to the increased queuing at the disks
(go back and take a look if you have
time).

Now, as we add more load to more


disks, the same effect can be seen
between the single thread tests and
the 4 thread tests i.e. per disk IOPs is
higher if we have more processes
accessing the LUNs. In all tests where
we have a single thread per LUN the
per disk IOPs is low compared to the
4 thread tests. This highlights some
key performance notes;

Concurrency, when using small IO


sizes, is essential for good
performance i.e. multiple
threads/processes.

Also, as we observed in the SP


statistics, overall array performance Here we see the distinct 6 areas of the first 6 tests performed on one
scales with how many LUNs are being Storage Processor. Each lower level is showing the single thread
accessed concurrently, so it was clear performance for 1, 2 & 3 LUNs. The higher peaks are representing
to see as more LUNs were busy, the throughput when 4 concurrent threads per LUN are generating IO.
overall throughput increased.

Do not expect to get maximum


performance from an array unless you
have the necessary disk count to
service the load. Please reference the
CLARiiON Performance and
Availability: Applied Best Practices
on scaling capability guidance for
each array type.

When finished Exercise-A, close down


the off-array browser and proceed to
Exercise-B

13
EMC World – May 2010

What you do Script


Exercise B – Analyzer Statistics Viewing Real-Time

This exercise is to direct you around some of the views in Analyzer while the array is under a
simulated load from a Windows server. You’ll be directed to look at some of the key statistics
that indicate if a system is functioning within acceptable parameters. This exercise is to extend
your experience and expand upon some descriptions of those statistics you are looking at – select
and deselect components and statistics to overlay graphs – but consider the scale of selections
such that high IO/s on the same graph as disk queue length will not be easy to distinguish queue
variation. If you have time you’ll be directed to look at some specific statistics in order to
determine where there is a problem with the current load on the array.

What you do Script


Start Internet Explorer <Enter the IP
address of your assigned managed Repeat as in the first steps in exercise-A
node ( SP IP address ) into the
address window of the browser

Login <Enter the user as emcw and You will be presented with the standard Navisphere view of the
password emcw> Domain you logged into.

Also refer back to Exercise-A for You have already looked at the logging mechanism so now we want to
customize options required to be set start viewing real-time statistics.
on the array

If only the Local Domain is shown, Performance statistics can be viewed for individual components or you
expand the view by clicking on the + can select the storage system and then view a selection of components.
by the Domain icon. To get this window, you can select the array, SP, raid group, Thin
Pool, storage group, LUN, Thin LUN or disk to choose which analyzer
<Right Click on your array and move view to look at. Here we’ll select the array to present all objects
the pointer over the Analyzer available.
selection to see the expanded list of
options>

<Select Performance Survey>

The Performance survey view will start to plot current statistics based
on a 60 second sample period – please wait until you have at least two
plots to continue i.e. wait at least 2 minutes for the plots to show.

14
EMC World – May 2010

What you do Script


The objective here is to watch the real
time view develop and start to look at If you setup the survey
some of the performance statistics chart thresholds as
that are being logged.
instructed earlier, you will
start to see green, yellow or
You have to wait for 2 samples to get
red boxes appear. These give
data plotted. Each sample is an
you an indication of where
average of statistics between each
to start looking for possible
sample except for write cache dirty
pages that are an absolute value at performance issues.
the sample point.

We’ll have a look at how to view some


of the key statistics used in analyzing Exercise-A gave you familiarity with the interface and through this
an array performance. exercise we’ll expand upon what some statistics mean.

Utilization – LUNs, SPs and Disks


Expand the LUN
In the Performance Survey view you
can double click on a graph to open component to
the Performance Detail view for that reveal the disks
statistic. Then you can select more and the storage
components to view, that will be processor that
placed on the same graph. currently owns
this LUN
Try it – pick one of the utilization
graphs in the Performance Survey
View and double click on it.
SP Properties
SP properties view is limited
Right click on the SP and select within a nar file compared to
properties. Ensure Cache is allocated the same view when
and enabled. connected to an array in real-
Total Size indicates possible time as displayed here.
maximum – look at the Read Cache
Size and Write Cache Size for
allocated cache.

15
EMC World – May 2010

Dirty Pages
Dirty pages are protected write cache To help when
data that hasn’t been committed to viewing a graph
disk yet. plot, you can click
To see appropriate value selections on the legend item
available in the lower left part of the in the lower right
detail view, you must select a window pane and it
component item in the upper left part will highlight that
of the view. Dirty Pages will only be statistic in the
an available option when you have graph. Also, you
clicked on a Storage Processor (SP). can customize the
Dirty pages that peak at 99% indicate graph views by
cache saturation resulting in force right-clicking on
flushing that can hurt performance. the graph and
We’ll look at LUN force flushes later selecting Chart
in this exercise. Configuration
.

LUN Bandwidth
Selecting both read and write Remember to
bandwidth tells us about the load on uncheck
the LUN however you will need to previous
check IO sizes and data locality to viewed
determine if the values seen are selections to
change the
expected based on the load.
graph scale,
We can check locality by looking at unless you
seek distances at the drive level later need to see
in this exercise. how one
statistic plots
against
another one.

LUN Forced Flushes


It’s very
Forced flushes are an indication of important to
write cache saturation – if you have see if any
many forced flushes taking place, this forced flushes
will impact the system as seen by
taking place.
increased SP utilization as well as
increased response times.

Although dirty pages may not have


shown being at 100% that statistic is
an absolute value at the sample time.
If you are seeing forced flushes taking
place then that indicates the cache
pages were 100% dirty at some point.

16
EMC World – May 2010

LUN Read I/O/sec


Read hits are when a
Looking at the IO/sec and Read host read comes in
Cache Hits/Misses, you can tell if the and the data is already
read pre-fetching is working. in cache.
Remember also that
A high ratio of read cache hits per pre-fetch activity may
second to LUN Read IO/sec is a good span sample periods
indicator of pre-fetching working. You i.e. pre-fetched data in
can directly see this ratio by looking one period may not be
at the Used Prefetches %. read until the next
period.

LUN IO Size distribution summary

This will enable you to determine In the lower left


where your host IO sizes fit. pane, you can
choose to show the
Right click LUN-2 in the Detail View, I/O rate at each
select IO Size Distribution Summary, size. Default is I/O
then in that view, you can select all count that means
values by right clicking in the value the total I/O’s for
pane of the window on the left. this sample period.

Right-click / Select All / Values

This is a histogram where each column represents IO in the range


from that size to the next size -1 block e.g. in the view here, we see a
value for reads that are 8KB and above, but lower than 16KB in size.

LUN Write Size and Full Stripe Writes


Another method to
You can view these back in the detect sequential
Performance Detail view to see over write access is
time, if coalesced writes are resulting comparing disk
in full stripe writes to the LUN. This write IO size with
indicates that write cache is working LUN write IO size
well and some writes are sequential. i.e. cache
coalesces smaller
If no Full Stripe Writes are seen, IO into fewer
writes to this LUN are more random larger IO when de-
and small, with little or no locality. staging data

Looking at the average LUN IO size for read or write in the detail
view can be misleading as it will be an average and a low write IO rate
will not be accurately shown. You really need to use the IO
Distribution Summary for the LUN to see the IO distribution.

17
EMC World – May 2010

Disks – Average Seek Distance &


Utilization

This will give an indication of data


locality and if the disk is working
hard. Be aware that a disk that shows
100% utilized isn’t necessarily bad as
the sample rate just indicates the disk
was never idle, and is reported from
the highest SP (the other SP may have
some more usage, up to 100% also).

You could look at the disk Average Cached writes may also result in bursty activity at the disk
Busy Queue Length and compare level as write cache flushes data. The trick is to not let that
with the Queue Length. If the activity lead you to think your host activity is bursty when it
Average is bigger than the reported isn’t.
Queue Length, this maybe an
indication of bursty activity.
LUN & Disks write size

This can give an indication that


coalescing is taking place in write
cache such that disk writes are bigger
than the LUN writes.

If we see the LUN and disk write size


is the same, typically this implies the
writes are very random and not
coalescing in cache to become larger
IO’s – or write cache is not being
used.

LUN 50 will show this but LUN 2 CLARiiON cache is great at optimizing back-end disk access,
doesn’t – can you explain why? particularly of benefit to Raid 5 and Raid 6 options that have write
Tip: check the LUN 2 IO Distribution penalties associated with small block random write activity.
and write IO rate

18
EMC World – May 2010

Performance Overview View


The best overview of the
Select the SP TAB in the performance cache configuration and only
detail view then right click on the place you can see the
array; watermark settings is in the
<Select Performance Overview> Overview screen

In the overview view you can see more Don’t be fooled by settings
detailed properties of cache together that may have been changed
with 3 key statistics for the overall during the logging period
array- Throughput, Bandwidth and though i.e. it may show
Dirty Pages. One particularly useful cache as enabled or disabled
detail is the watermark settings as here, but you should verify
they are not visible anywhere else that with read cache statistics
when looking at an Analyzer NAR file and dirty pages in the other
offline to the array it came from. views

The cache states shown here will be


set when the logger started the
current nar/naz file logging so be sure
to determine actual settings from
measured metrics.

Dirty pages on each SP indicates


write cache is enabled at the array
level.
Read cache hits, pre-fetch bandwidth
are some indicators that read cache is Watermark settings are used to intelligently flush write cache pages
enabled. out to disks on the backend and keep a level of write cache available
for bursts of activity.

Raid Group / Thin Pool

Select the Storage Pool TAB in the


Performance Detail view. Careful not
to get confused here as the Raid
Group and Thin Pool statistics are
derived from disk statistics, not LUN
statistics. Thus, the values will depend
on all activity for all LUNs within that
raid group or all Thin LUNs in a Thin
Pool.
Thin Pools aren’t covered in any This reference was more for information as we’re not going to be
specific detail here although disk looking at raid group statistics specifically for the exercise. The
statistics are logged and can be Storage Pool TAB is the only method to analyze disk activity within a
analyzed in the same way as a regular Thin Pool. Thin Pools do have regular LUNs that are considered
Raid Group. private and hidden from view, including Analyzer.

19
EMC World – May 2010

The 2 exercises have explored the options and views available to you, with some explanation and
guidance on what the statistics mean and how they help in characterizing your IO.

This following section, should you have time to look at it, will guide you to look at the loads on a
specific set of LUNs sharing the same set of disks.

You have explored the views, now the task is to analyze a specific area where we have an issue.
Now, please explore the interface and
look at the following attributes for this
load on the array with a focus on Raid
Group 0, LUNs 50 & 51. Look at the
following for each of these LUNs and
see if you can draw any conclusions
(make notes on the worksheet table at the back of
this handout);

LUN read and write throughput


LUN read and write IO size
LUN Response Time
LUN Queue Length
LUN IO sizes
Disks read and write throughput
Disk seek distance
Disk IO sizes Look at the profiles for these two LUNs…. How are they different and
Disk Queue Length what would be a suggestion on improving performance?
Disk response time

Define the IO profile associated with


each LUN and think about what they Using this hint, think about what helps sequential operations
can be. There is an area where we do and what could also hurt it.
have an issue that we want to fix.
If it is sequential – are we seeing pre-fetching and a high pre-
Hint; one of them is doing large fetch used rate? We do need to know what the application is
sequential reads – could be a video trying to do, then correlate that with characterized IO on the
or data warehousing application. storage system to see if it is doing what we think it should be
doing.
So, what about the other LUNs that
showed up red and/or yellow boxes? Raid Group 4 has multiple LUNs with different IO characteristics.
As explained during the overview, the Check some of the metrics and see if you can conclude anything about
red/yellow boxes give an indication of this raid group. Don’t spend too much time on this task as the prior
where to investigate and not elements focusing on LUNs 50 and 51 are the main points of this
indicative of an absolute problem. exercise. If you have time, you might want to take some notes on
LUNs 2 and 3 statistics for reference (are they busy? Is that bad?)

20
EMC World – May 2010

Additional reference notes

Although these exercises cover the SnapView Snapshot sessions Background zeroing for bind operations
performance statistics from host SnapView clones Background verifying
MirrorView/S activity
accessing LUNs in the array, there
MirrorView/A activity
maybe additional load generated SAN Copy activity
internal to the array. This load could Raid Group rebuild activity
include those shown opposite; Hot spare equalize activity
LUN migration operations

Typically, layered application IO will


not be logged at the LUN level but
will be visible at the disk level. If you In the metric selection
understand what is taking place, once window, since Release-26 of
accustomed to the user interface and code, you will see options for
operational characteristics of layered Optimal and Nonoptimal
applications, you can determine what metrics for LUNs. These are
disk activity relates to host access or used when you have LUNs
internally generated IO. using the ALUA failover mode
(mode=4) of operation.
Sometimes, you will observe a blip in Typically, you would see
the statistics i.e. a value for a statistic Optimal values when accessed
outside normal range. To overcome from the current owning SP,
this being a nuisance, you can either and Nonoptimal values when
restart the plotting or adjust the accessed from the non-owning
scaling of the graph plots by zooming SP so a slightly longer path.
in or setting the Chart Configuration
Axes options in the graph view. When not running in ALUA
mode, selecting either the
Real-time viewing of Analyzer regular metric or the metric-
statistics isn’t the preferred method Optimal will display the same
due to the requirement to be there at values.
the time, as well as the additional
impact to the array in presenting the
information. Typically you would
look at a captured Analyzer NAR file
as covered in Exercise-A.

21
EMC World – May 2010

Supplemental, Command Line NAR file retrieval and export capabilities


This is to direct you into the capabilities to script NAR file retrieval for lights out performance
statistics gathering. As the Navisphere archive file collects data covering the previous 5 hours of
statistics, the capability to script retrieval of the NAR file is useful when you want statistics for a
period of activity and you’re unable to retrieve the file in the normal way using the Navisphere
GUI interface e.g. statistics logged on Saturday would need to be retrieved sometime Sunday or
they would be overwritten by Monday. With the release of revision 24 and later revisions of
code, the Analyzer Archive facility allows the automatic archiving of Analyzer files on the array
itself for later retrieval via the GUI or CLI process, and retained for a much longer period than
the previous 5 hours (or 25 hours for older code archives). Remember though that if Periodic
Archiving is not enabled, you will only grab the prior 5 hours of data by default when you
retrieve the archive.

<Ensure NaviCLI utility is installed.


This is easily done if the Navisphere
CLI directory is present. Here you can
double click the shortcut on the
desktop called NaviCLI>

This will start a command window Scope will be 1 if the account details used are local and not global.
that will go to the default installation
directory c:\Program Do not do this here but you can reset the statistical data by using the
Files\EMC\Navisphere CLI following command if you are looking to collect data for a specific
test period only and you are not interested in previous collected data;
(Username and password will be “naviseccli –user <username> -password <pwd> -scope <0¦1>
emcw for the following commands) -address <SP IP> analyzer –logging -reset”

<Retrieve the Navisphere archive files The username and password can be omitted if you have setup the
using the following command; security file for NaviSeccli.
“naviseccli –user <username> - The username used here does not have privileges to reset data logging
password <password> -scope 0 on the arrays being used.
–address <SP IP> analyzer –archive
-all” Note; the desktop shortcut used here is not created for you during
installation. You have to do that yourself if you want that shortcut
Be careful with this command as you available on your own systems.
may have many archives to download
when selecting all and it could take a Prior to release 24 you would need to use the java archiveretrieve
long time to complete. By omitting command to get the archive from the array;
the “all” you are presented with a
selection list where you can select “java –jar archiveretrieve.jar –User <username> -Password
one or more archives to retrieve. <password> -Scope 0 –Address <array IP> –File archive_emc.nar –
Location “C:\program files\emc\Navisphere cli” –Overwrite 1 –Retry 2
–v”>

22
EMC World – May 2010

Now you can follow these steps and


open the retrieved NAR file using the If Excel format is required you can use the archivedump command to
on-array or off-array capability, or covert the NAR file data to a format readable by Excel, typically CSV.
you can convert the NAR file data to
CSV format for import into Excel.

You can use the following command


to do this;
< naviseccli –user <username> -
password <password> -scope 0
–address <SP IP> analyzer - Some more qualifiers for the format command are as follows
archivedump -data test.nar -out (separate with a comma if used);
test.csv -object s,l,d >
Utilization (%) u
You can also filter the output to only Response Time (ms) rt
get specific statistics like read Dirty Pages (%) dp
throughput adding the qualifier –
format rio For other qualifiers please consult the Admin Guide.

This example outputs stats for SP’s, If you leave off the qualifier –object, it will output all statistics for all
LUNs and disks (-object s,l,d). To get objects.
stats for metaluns, etc, please consult
the document; Navisphere Analyzer The Navisphere UI has an Analyzer dump wizard that guides you
Administrator's Guide.pdf through device and attributes selection prior to dumping to a CSV file.

Start Excel and select “open file” and


browse to the c:\Program
Files\EMC\Navisphere CLI directory
and select file type as CSV, then open
the sample.csv file you created in the
last step to view the statistics as
presented in Excel.
If Excel 2007, use the INSERT TAB to
display graphing options.
If you’re not too familiar with Excel
but would like to plot a graph showing
one of the statistics over time, you can
easily do this by selecting a column by
clicking on the header letter, then
once the column is highlighted, click
on the chart wizard icon in the tool
bar, select line as chart type, then Please note that each device selected in the dump command, like
click next to see what the chart would SP and LUNs will be listed down the left column, so selecting
look like. You can then customize it as an entire column to plot would actually plot all SP stats followed
required. by all LUN stats, and so on. You would need to be more
selective and manipulate the data when plotting graphs in a
logical manner.

23
EMC World – May 2010

You can try the archivedump “naviseccli –user <username> -password <password> -scope 0
command and be more specific on –address <SP IP> analyzer -archivedump -data test1.nar -out
some qualifiers shown previously. test1.csv -object s –format u,dp”

You could also have a go at the dump This will output test1.csv containing SP statistics of utilization and
wizard from the Analyzer drop down write cache dirty pages.
in the Navisphere Manager GUI –
Another option is archivemerge; used to merge multiple NAR files
using off-array Navisphere.
together. We don’t use that in this session but remember that this is
useful if you want to view data access trends that span more than the
typical NAR file size of 5 hours.
It is not necessary to merge nar files from both SP’s as each SP has the
same data.

The array based archivedump wizard


provides an easy way to dump specific
statistics associated with individual
devices rather than using the CLI
method shown above.

With either on-array or off-array,


select Tools, Analyzer, Archive, Dump

Then select where the source file is


located and follow the wizard to select
objects to dump and what statistics
you require.

24
EMC World – May 2010

Supplemental, Thin LUN Analysis


This is to highlight some differences in metrics available for Thin LUNs in a CLARiiON
environment and the way in which we view them.
There is a read/write load running to LUN201 on the array. This is a Thin LUN provisioned
from a pool of 3 disks.
Thin Pools in a CLARiiON have a private structure that isn’t visible in the Navisphere
interface. This structure has private LUNs that Thin LUNs utilize in 1GB increments. With the
experience gained from the primary exercises you can take a look at the active Thin LUN and
how to observe IO to both it and the Thin Pool disks.

Check Thin LUN properties.

<Right click on LUN 201 and select


Properties>

<Here you can see the Pool


properties that this LUN is serviced
from and the Thin LUN virtual size
and actual consumed capacity from
the pool>

<When selecting the Thin LUN, you


do not see cache operations
These are metrics you will not see when
associated with that LUN as these are
selecting Thin LUNs to analyze. This
associated with the private LUNs
may change in a future release, but for
servicing the IO to the Pool and those
now; you have to look at the Thin Pool
are hidden from view >
disk characteristics to determine what’s
happening in the Pool as a whole.
Regular IO metrics like throughput,
bandwidth, and response time are
available for each Thin LUN.

<Unlike regular LUNs in the LUN


TAB view, you only see the SP a Thin There are no specific
LUN is assigned to. To see the disks instructions on what to
servicing the Thin LUN and its Pool, investigate here although if
you have to select the Storage Pool time, compare the disks
TAB > within the Pool and how
those align with the Thin
<Select the Storage Pool TAB>
LUN characteristics.
<In the view, you can expand the Pool
to see the disks servicing the total
Pool load. You cannot see the private
LUNs that are hidden in the Pool>
End of exercises

25
EMC World – May 2010

Supplemental notes
The following operations are executed at the disk level to provide data integrity features associated with
redundant RAID types as well as consistency of data stripes that could be at risk due to media issues.

Background zero; Before user data can be written to the physical disks within a LUN, the area has to
undergo a zero operation. New disks are initially supplied in a zero state where data can be written to the
disks immediately after binding LUNs, however if the disks have been used before i.e. bound and
unbound, they have to be re-zeroed.

You can zero the disks using a naviseccli command in readiness for grouping and binding LUNs later on
or the array will zero the disks when you create new LUNs on them. This zero operation results in
512KB SCSI write-same commands to the disks in a sequential manner, unless the array has to zero-on-
demand an area the user is writing too that is in the queue but hasn’t been zeroed yet. There is some
other small activity on the disks during zeroing as checkpoint operations keep track of progress.

Typically with no access to the LUNs any zeroing will complete in a matter of a few hours although a
busy array and activity to the disks being zeroed will delay completion. Also, the 512KB write-same
command will not consume back end bandwidth but will affect disk load and utilization.

Background verify; This operation is validating consistency of data protection at the disk level and is
automatically performed on newly created LUNs. The IO profile at the disk level is 64KB reads and like
zeroing, can take hours to complete and is also governed by array and disk activity.

Background zero, zero-on-demand, and background verify operations exhibit relatively large IO sizes
that can affect one’s analysis of the array. Also, if considering user testing its worth noting these
operations may affect the performance the array can present due to the parallel action of user data
access and these preliminary operations.

Also be aware these operations run in a sequential manner for any given raid group(RG) e.g. if you bind
5 LUNs on a RG 0 through 4, LUN 0 will start to zero and when complete will perform a background
verify. This is followed by the second LUN in that RG. Each LUN will zero then verify until all newly
created LUNs complete that process. Thereafter the only regular IO you will see at the disk level due to
internal operations will be SNiiFFER where you will see approximately 1 IO per second at 512KB in size
to each disk in a RG. SNiiFFER is a data checking operation that cycles through every block in every
LUN in the array to ensure data availability, even for data you might not have touched for months/years.
Any data inconsistency detected through SNiiFFER will automatically invoke recovery and remap of
affected blocks. RG’s will run through zero, verify and SNiiFFER operations independent to each-other.

Zeroing will have the most effect on performance so consider this when testing. Verify may have a small
effect and SNiiFFER will have a negligible effect on performance.
Always check disk stats to see what IO sizes are taking place at that level. With a RG idle, disk activity
showing 512KB writes indicate zeroing, 64KB reads indicate verifying and 512KB reads indicate sniffing.
{end}

26
EMC World – May 2010

Worksheet – use as needed during exercises.

LUN ID 50 51
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time

LUN ID
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time

27

Das könnte Ihnen auch gefallen