Sie sind auf Seite 1von 20

Splunk Live

October 2010

altin.papa

@riskfocusinc.com

vassil.avramov

Who we are
Established financial services technology consulting company
Founded in 2004 by experts in risk management technology Exclusive focus on Capital Markets Engaged at top-tier international banks and hedge funds Offices in NY, London, Bangalore www.riskfocusinc.com

Broad product, functional and technology expertise


Expertise translates into common solution patterns which are reused to client benefit
Products: Credit, Rates, Commodities, FX Process: Trade Capture, Valuation, on demand / end of day Valuation / Risk, Enterprise Market / Credit Risk

FpML: A Common Language for Financial Communication

Our Approach
We aim for better, generalized solutions to problem patterns

page 2

Presentation Agenda The Enterprise IT Problem Challenges of Enterprise Systems Splunk Solutions for the whole Software Development Lifecycle
Cross-cutting concerns Design Release Cycle Operations

Recommendations

page 3

Towers of Hanoi or Tower of Babel?

page 4

The algorithm

Strategic Success

Effective Communication Splunk

Clear Message

Common Language
Reactive

page 5

The architecture

Robust System

Transparent Conversations Splunk

Message Driven

Common Format
Reactive

page 6

Unified Operational Intelligence with Splunk

Capital Markets systems:


Expensive Complex Large operational and support teams Maintenance/support lags development initiatives Costly downtime

Maintenance:
Preventive is better than Corrective Corrective Maintenance: quick and replicable

page 7

EXAMPLE: Fictional Trading System Diagram

page 8

Operational Patterns in Large Systems How do we apply behavior across functional components? Cross-cutting concerns
Apply to all parts, regardless of function At application level, often handled via Aspect Oriented Programming:
Logging Performance Profiling Security Transactionality

But what about at higher levels? This is how the operations team experiences the system

page 9

Cross cutting at the APPLICATION Level


<class> <class> <class>

Novation Handler

Trade DAO

Message Listener

Logging

page 10

Cross cutting at the SYSTEM Level Client Trade Processing External Gateway

Log Aggregation

page 11

Cross cutting at the ORGANIZATION Level Trading System Market Data System Valuation System

Operational Intelligence

page 12

Design Problem: The Design Paradox


Modular and Distributed are great for design and development
increased productivity improved flexibility

They make a system look fragmented to the operational teams. Borders are problematic

Example
An issue occurs within one of the components This leads to an incident across the border The symptoms are observed in a different place at a different time

Solution
Aggregate all logs and cross-index them Create an integrated dashboard

page 13

Dashboard

See issues by:


functional area component support classification etc.

page 14

Conversation

Track a problem message across all components

page 15

Release Cycle Problem : The Problem Only Occurs in Production (good acronym)
Tests passed For some reason we only see the problem once the system is live

Example
Exception occurred in QA/UA, but tests passed and no one saw it Same problem blew up in Production later

Solution with Splunk


Tag & Categorize events
Ignorable Known (and have recipe for recovery) New

Link to everything:
Knowledge Base (e.g. Support Wiki) Source Control viewer (FishEye) Build Server (TeamCity/Hudson) Bug Database (e.g. Jira)

page 16

Root Cause

Show problem FpML message via ReST Drill through to Support Wiki for solution

page 17

Operations Problem: The Non Sequitur


Lack of context makes investigation very expensive Collaboration frequently means long conference calls

Example
We have a problem. Can you look at it? Collaborative effort preceding call is lost Inability to correlate events across components and over time Inability to look historically.
When did the problem appear first? Did we just introduce it in this release?

Solution
Just email a Splunk link Single entry point for ALL INTELLIGENCE on this problem It can be passed around with no loss

page 18

Performance
Support Email: Sync was slow starting 1pm. Any ideas?

Useless without Splunk; legitimate with it

See trends over time, across releases Confirm, drill down, resolve

page 19

Recommendations
Good Design takes into account the whole lifecycle of a System

You will be remembered for the failures


Volume, speed, etc You CAN have it both ways: clarity does not have to hinder performance Splunk helps

The challenge is Clear Communication. The requirements are

Design for transparency


Optimize for people not machines. Hardware is cheaper Design for the end user Design for the operations team State should be human readable

Design for scalability


Make it faster by adding hardware not by compromising transparency Make it faster only after it works and is transparent

A system chain is only as strong as the weakest link


Splunk unifies it all

page 20