Sie sind auf Seite 1von 20

Splunk Live

October 2010



Who we are
Established financial services technology consulting company
Founded in 2004 by experts in risk management technology Exclusive focus on Capital Markets Engaged at top-tier international banks and hedge funds Offices in NY, London, Bangalore

Broad product, functional and technology expertise

Expertise translates into common solution patterns which are reused to client benefit
Products: Credit, Rates, Commodities, FX Process: Trade Capture, Valuation, on demand / end of day Valuation / Risk, Enterprise Market / Credit Risk

FpML: A Common Language for Financial Communication

Our Approach
We aim for better, generalized solutions to problem patterns

page 2

Presentation Agenda The Enterprise IT Problem Challenges of Enterprise Systems Splunk Solutions for the whole Software Development Lifecycle
Cross-cutting concerns Design Release Cycle Operations


page 3

Towers of Hanoi or Tower of Babel?

page 4

The algorithm

Strategic Success

Effective Communication Splunk

Clear Message

Common Language

page 5

The architecture

Robust System

Transparent Conversations Splunk

Message Driven

Common Format

page 6

Unified Operational Intelligence with Splunk

Capital Markets systems:

Expensive Complex Large operational and support teams Maintenance/support lags development initiatives Costly downtime

Preventive is better than Corrective Corrective Maintenance: quick and replicable

page 7

EXAMPLE: Fictional Trading System Diagram

page 8

Operational Patterns in Large Systems How do we apply behavior across functional components? Cross-cutting concerns
Apply to all parts, regardless of function At application level, often handled via Aspect Oriented Programming:
Logging Performance Profiling Security Transactionality

But what about at higher levels? This is how the operations team experiences the system

page 9

Cross cutting at the APPLICATION Level

<class> <class> <class>

Novation Handler

Trade DAO

Message Listener


page 10

Cross cutting at the SYSTEM Level Client Trade Processing External Gateway

Log Aggregation

page 11

Cross cutting at the ORGANIZATION Level Trading System Market Data System Valuation System

Operational Intelligence

page 12

Design Problem: The Design Paradox

Modular and Distributed are great for design and development
increased productivity improved flexibility

They make a system look fragmented to the operational teams. Borders are problematic

An issue occurs within one of the components This leads to an incident across the border The symptoms are observed in a different place at a different time

Aggregate all logs and cross-index them Create an integrated dashboard

page 13


See issues by:

functional area component support classification etc.

page 14


Track a problem message across all components

page 15

Release Cycle Problem : The Problem Only Occurs in Production (good acronym)
Tests passed For some reason we only see the problem once the system is live

Exception occurred in QA/UA, but tests passed and no one saw it Same problem blew up in Production later

Solution with Splunk

Tag & Categorize events
Ignorable Known (and have recipe for recovery) New

Link to everything:
Knowledge Base (e.g. Support Wiki) Source Control viewer (FishEye) Build Server (TeamCity/Hudson) Bug Database (e.g. Jira)

page 16

Root Cause

Show problem FpML message via ReST Drill through to Support Wiki for solution

page 17

Operations Problem: The Non Sequitur

Lack of context makes investigation very expensive Collaboration frequently means long conference calls

We have a problem. Can you look at it? Collaborative effort preceding call is lost Inability to correlate events across components and over time Inability to look historically.
When did the problem appear first? Did we just introduce it in this release?

Just email a Splunk link Single entry point for ALL INTELLIGENCE on this problem It can be passed around with no loss

page 18

Support Email: Sync was slow starting 1pm. Any ideas?

Useless without Splunk; legitimate with it

See trends over time, across releases Confirm, drill down, resolve

page 19

Good Design takes into account the whole lifecycle of a System

You will be remembered for the failures

Volume, speed, etc You CAN have it both ways: clarity does not have to hinder performance Splunk helps

The challenge is Clear Communication. The requirements are

Design for transparency

Optimize for people not machines. Hardware is cheaper Design for the end user Design for the operations team State should be human readable

Design for scalability

Make it faster by adding hardware not by compromising transparency Make it faster only after it works and is transparent

A system chain is only as strong as the weakest link

Splunk unifies it all

page 20