Sie sind auf Seite 1von 14

Big Data Cases in Banking and Securities

A report from the front lines

Sponsored by

Jennifer Costley
Peter Lankford
30 May 2014


This document was produced by the Securities Technology Analysis Center, LLC (STAC )
Copyright 2014 STAC. All rights reserved. May not be reproduced by any means without
express permission. STAC and all STAC names are trademarks or registered trademarks
of STAC. Other company and product names are trademarks of their respective owners.
Big Data Cases in Banking And Securities Page 2

Aboutthestudysponsor
Todaythefinancialservicesindustrydependsoninnovationmorethan
ever to run its business. Intel based technology for clients, servers,
storage, and networking is the foundation for the new and open
infrastructurerequiredtodeliverthisinnovation.AsIntelcontinuesto
deliver value in each newgeneration ofits products, it is important to
understand the business requirements and challenges of the key end
users of its technology. For this reason Intel is proud to sponsor this
research by STAC and to drive thought leadership in the fast growing
area of Big Data. Silicon technology, complemented by software products and services from Intel,
provideanidealfoundationonwhichtobuildsolutionsforthecurrentandfuturerequirementsofdata
intensivecomputingforfinancialservices.

AboutSTAC
STAC is a technologyresearch firm that facilitates the STAC
Benchmark Council (www.STACresearch.com/council), an
organization of leading financial institutions and technology
vendorsthatspecifiesstandardwaystoassesstechnologiesused
in finance. The Council is active in an expanding range of bigdata, bigcompute, and lowlatency
workloads.STAChelpsuserfirmsrelatetheperformanceofnewtechnologiestothatoftheirexisting
systems by supplying them with STAC Benchmark reports as well as standardsbased STAC Test
HarnessesforrapidexecutionofSTACBenchmarksintheirownlabs.SomeSTACBenchmarkresults
from vendordriven projects are made available to the public, while those in the STAC Vault are
reservedforqualifiedmembersoftheCouncil.

Abouttheauthors
JenniferCostleyisascientificallytrainedtechnologistwithbroadmulti
disciplinaryexperienceinenterprisearchitecture,softwaredevelopment,line
management,andinfrastructureoperations.Following31yearsof
technologyleadershipinorganizationslikeCreditSuisse,BankersTrust,and
DoubleClick,Jennifernowconsultstocompanies,nonprofitorganizations,
andindividualsinareasrelatedtodata,governance,andsustainability.This
includesroleswiththeIEEEandtheSTACBenchmarkCouncil.Jenniferhasa
PhDinChemicalPhysicsfromColumbiaUniversityandaBachelorsinPhysics
andChemistryfromBrandeisUniversity.

PeterLankfordisfounderanddirectorofSTAC,whichfacilitatestheSTAC
BenchmarkCouncil.Inthisrole,Peterhashelpedthefinanceindustrycreate
benchmark standards in areas such as databound timeseries analytics,
computebound risk simulations, and I/Obound transformation and
distribution of highspeed data. Prior to STAC, Peter was SVP of the the
$240M market data technology business at Reuters and held management
positionsatCitibank,FirstChicago,andoperatingsystemmakerIGC.Peter
hasanMBA,MastersinInternationalRelations,andBachelorsinChemistry
fromtheUniversityofChicago.

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 3


ExecutiveSummary

Investment bankingand retail banking often appear near the top of the list of industriesinvesting in
"bigdata"technology.Yetinformationabouthowbanksareusingthattechnologyissparse.

STACrecentlyworkedwithseveralofthelargestglobalbankstoidentifyconcreteusecasesthatpose
bigdatachallenges.Ourobjectivewastostudycasesthatwerespecifictobankingratherthancases
suchaswebloganalyticsandintrusiondetection,whererequirementstendtobefairlycommonacross
industriesandarebetterunderstood.Byinterviewingstaffwithdirectknowledgeofthecases,wewere
able to characterize the workloads involved and understand the business problems that arise with
traditionaltechnologies.Wewerealsoabletolearnabouttheadvantagesandchallengesofthenew
approachesthatbanksweretakingtotheseworkloads.Thispapersummarizessomeofthesefindings.

The primary purpose of these discussions was to lay the groundwork for technology benchmark
standards that can be applied to big data problems. The STAC Benchmark Council (to which these
banksbelong)developsbenchmarkspecificationsfortechnologiesusedinstrategicbusinessfunctions.
Thesespecificationsbecomeacommonyardstickbywhichuserfirmsandvendorscanunderstandthe
capabilities of competing solution stacks. We're grateful to Intel (also a member of the Council) for
providing the seed funding to accelerate the benchmark development process in big data. The
productionofbenchmarkspecificationscoveringperformance,scaling,resourceefficiency,resilience,
securityandentitlements,andotherkeybusinessmetricsisnowunderway.Meanwhile,wedecidedto
shareourlearningsoutsidetheCouncilthroughthispaper,bothtoraisethelevelofawarenessinthe
industryandtoattractmoreparticipantstotheproject.

When soliciting cases from the banks, we defined "big data" as a problem, not a technology.
Specifically,wedefinedabigdataworkloadasonethatistoodifficultorexpensivetohandleusing
traditional technologies, largely due to data scale or complexity. By focusing on problems rather
thansolutions,wedidnotrestricttheinvestigationtoparticulartechnologiesandwereabletoinclude
some cases that banks had not yet solved. This definition was also flexible enough to cover old
workloads as well as new ones. A workload could be "too difficult or expensive to handle using
traditional technologies" either because the workload is uncommonly cumbersome or because new
technologies render traditional technologies difficult or expensive by comparison, even for existing
problems.Aswewilldescribe,wefoundbothkindsofcases.

Ourstudywasqualitative,notquantitative.Becausethepurposewastocharacterizesomeimportant
workloads (rather than,say,estimate market sizes), we dove deepintoa few cases with select firms
ratherthancanvassingnumerousorganizationswithhighlevelquestions.Ifourresearchintobigdata
use cases is currently in the "stamp collecting" phase, then this paper describes our first stamp
collection. 1

Whilethenumberofcasesissmall,ourlongexperienceintheindustrysuggeststheyareworthnoting.
Thebanksweinterviewedtendtobetechnologyleaders,oftenbuildingtheirownsolutionsfromthe
best tools available rather than looking for prebuilt applications to buy. The solution patterns they
establish tend to filter into the broader industry over time, both to other banks and to application
vendors.

1
"Allscienceiseitherphysicsorstampcollecting."ErnestRutherford

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 4

Lookingacrossthesecases,afewthemesemerged.Herearethehighlights:

It'snotjusthype.Investmentandretailbankshavemovedtonewtechnologiesforbigdata
problemsinimportantbusinessfunctions,andusageisgrowing.
It's not just ETL. The analytic complexity of the workloads we studied ran the gamut, from
basictransformationstomachinelearning.
About half the cases we encountered were about doing new things, while half were about
doing existing things faster or more cheaply. We're sure there was selection bias against
reporting "new things," which tend to be competitively sensitive and thus not readily
discussed.
Ofthefamous"threeV's"(volume,variety,velocity),thestrongestdriverwasvolume.About
halfthecasesweencounteredinvolvedapetabyteormoreofdata.Varietytooksecondplace.
Whilesomeofthecasesinvolvednaturallanguagetext(e.g.,email,socialmedia),mostofthe
content in the workloads we studied was more structured, and none of it was completely
unstructured(e.g.,video,images).Nevertheless,someofthecasesinvolvinghighlystructured
formatsstillpresentedgreatvarietyatasemanticlevel.Whatisoftencalled"velocity"wasa
driver in only one of our cases. This is partly because the retail banks don't deal with much
velocity yet, while the investment banks have dealt with high velocity for so long that their
currentsolutionsforthoseproblemsaregenerallysuperiortoemergingbigdatatechnologies.
Hadoop has pole position in these cases. Despite the technologyagnostic way that we
solicitedcases,Hadoopturnedouttobeprominentinmostofthem.Othertechnologieswere
ofteninvolved,butHadoopwasatthecenter.
Abouthalfthebankshavebuiltorarebuildingmultitenantanalyticsplatformsusingnew,big
datatechnologies.Somearedepartmentalinscope,whilesomespantheentirecorporation,
across retail, commercial, and investment banking. This is an area rich with possibilities and
challenges.
Most of the banks felt that expansion of big data technologies into a broader range of use
caseswasconstrainedbyinsufficientfunctionalityincriticalareassuchasentitlements.

Thispaperdescribesthesethemesinmoredetail.Specificsoftheworkloadsandtheproposedbigdata
benchmarkspecificationsarerestrictedtomembersoftheSTACBenchmarkCouncil.Membershipis
available to both vendors and users of technology. We welcome the involvement of all interested
parties. 2

2
See www.STACresearch.com/bigsig

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 5

Background

TheSTACBenchmarkCouncilconsistsofleadersfromover200financialinstitutionsand50vendor
organizationswhodeveloptechnologybenchmarkstandardsbasedonworkloadsthatarestrategically
importanttouserfirms.ExamplesofCouncileffortsinrecentyearsincludeSTACM3,thebenchmark
suite for analytics on large stores of historical market data (the lifeblood of trading), and STACA2,
benchmarks based on market risk management (the kind of workloads that drive hundreds of
thousandsofprocessorcoresacrossWallStreet).

Withmoreandmoreoftheirengineeringchallengesfallingundertheumbrellaof"bigdata",anumber
ofbanksandvendorsintheCouncilformedtheSTACBigDataSpecialInterestGroup(SIG)inMarch
2013. Its aim was to provide a forum to discuss challenges and solutions in the area of big data and
ultimatelytodeveloptechnologybenchmarksforbigdataworkloads.

Few areas cry out for good technology benchmarks more than the Wild West of big data. Solution
designersfacedozensofnewsoftwareandhardwareproducts.Theymustunderstandwhichproducts
anddesignpatternsaresuitedtowhichusecases.Andtheymustdeterminewhethertheseproducts
deliver not only the transformative capabilities that they promise, but also the boringyetcritical
functionalitytakenforgrantedintraditionalarchitectures.Theneedforrigorousbigdatabenchmark
standardsisespeciallystronginfinance,wheretheopportunitytoturninformationintomoneyishuge,
butthecost,quality,andsecurityconstraintsgrowstrongerbytheday.

As a founding member of the SIG, Intel heard this cry and graciously provided seed funding to
accelerate the process of defining the workloads, developing the benchmarks, and sharing the
learnings.

Theworkloads

Inthefirstquarterof2014,westudied16projectsat10ofthetopglobalinvestmentandretailbanks,
interviewing individuals with direct knowledge of the workloads. From these interviews, we
documented10workloadsandtwo"metaworkloads"thatmetourbigdatadefinition. 3

Asexplainedabove,wedefinedbigdataasaworkloadthatistoodifficultorexpensivetohandleusing
traditionaltechnologies,largelyduetodatascaleorcomplexity. 4 Inmostofthecases,thefirmswere
handlingtheseworkloadspartlyorcompletelyvianewtechnologies.Inafewcases,thebankswerestill
usingtraditionaltechnologiesbutwereactivelyseekingnewapproaches.Inonecase,abankdescribed
aworkloadthatmightbeagoodcandidateforbigdatatechnologiesifitweren'tforcertainlimitations
itperceivedinthoseproducts.

Table1onthefollowingpagebrieflysummarizestheworkloads.Thecolumnstotherightofagiven
workload indicate the type of banking involved in the interviews in this study. Some workloads that
aroseinjustonecategoryinourinterviewsmayalsoapplytotheothercategoryortootherformsof
banking not represented in this table. Roughly half the workloads arose in both retail banking and
investmentbankingcontexts(infact,someofthemsuchasenterprisecreditriskreportingalsoinvolve
otherformsofbanking,suchascommercialbanking).Oneofthemwasclearlyspecifictoretail(card
frauddetection),whileseveralwerespecifictoinvestmentbanking(mostlyrelatedtotrading).Thetwo
ITrelated workloads are not necessarily industry specific, but we included them because of the
centralityofITtoinvestmentandretailbanking.WeimaginethatotherITintensive,heavilyregulated
industrieshavesimilarusecases.

3
Thesedonotsumto16becausemultipleinterviewsconcernedsomeofthesameworkloads.
4
ThisdefinitionemergedfromtheinitialmeetingoftheSIG.

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 6

Workloadsencounteredinthisstudy

Workload R I

Cardfrauddetection.Usingstatisticalmodelsoncreditanddebitcard

transactiondatatodetectfraudulentactivitieswhileminimizingfalsepositives.

Securitiesfraudearlywarning.Detectingpotentialsecuritiesfraudby
institutionalclientstosupportlegalrequirementsforduediligenceof
transactions,byanalyzingdatadrawnfromhundredsofinternalsystems.

Enterprisecreditriskreporting.Providingaconsolidatedviewofexposureto
differentkindsofcreditriskacrossretailandwholesaleproducts,basedonloan
ledgers,tradingpositions,economichistory,andothersources.

Tickanalytics.Providingsimpleanalyticsontimestampedorderbookdata

collectedfromrealtimeexchangefeeds.

Socialanalyticsfortrading.Usinghistoricalsocialmediacontenttodevelop
indicatorhistories(e.g.,sentiment)foruseintradingalgorithms.Creating
indicatorsfromrealtimesocialfeedstoenableuseofthosealgorithms.

Tradevisibility.Supportingbrokeragefunctions(customerservice,regulatory
reporting,surveillance,advisory,prospecting)thatneedtointerrogatedeep
historiesofcustomertransactionsacrossallassetsandbusinesses.

Archivalofaudittrails.Providingstorageandretrievaloftransactionsfor
manyyears,asrequiredbyregulatiossuchasFINRAsOrderAuditTrailSystem
(OATS).

Customerdatatransformation.Transforminginbounddatafeedsfrom

custodiansandothersourcesandloadingintoacentralizeddatawarehouse.

ITpolicycomplianceanalytics.Identifyingdeviationsfrompolicybyanalyzing
tensofmillionsofdailyeventrecordsandconfigfilesrelatedtothousandsof
regulatoryrelevantapplicationsandtheirassociatedinfrastructure.

IToperationsanalytics.Usingeventlogs,configfiles,emails,andutilization
datatoimproveplanning,performance,andresourcemanagementandto
identifysecurityrisks.

Certainmainframeworkloads(ametaworkload).TasksthatuseVSAMfile

structuresthattranslateeasilytokeyvalueformats.

AnalyticsPlatformasaService(ametaworkload).Providingadvanced
analyticscapabilitiesasaserviceformultipleworkloadsacrossmultiplelinesof
business.

R=Retailbanking&wealthmanagement.I=Investmentbanking&brokerage.

Figure1

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 7

Businessdrivers

Thebusinessforcesdrivingbankstoseekbigdatatechnologiestohandletheworkloadsabovewillnot
surprisethosefamiliarwiththerecentfocusofthebankingindustry.

Almosthalftheprojectsweencounteredweredrivenbyregulatoryorlegalrequirements.Theimpact
of regulation cuts across all functions within a bank and reaches deep into the organization. Risk
regulation under Basel III (and other regimes) requires more comprehensiveness and accuracy in the
calculation of exposures across asset classes and counterparties. Legal settlements have specific
requirements to improve transaction oversight and avoid the recurrence of sanctions. Even IT
organizationsinbanksarenotexempt:manyoftheirprocessesfallundertheSECsRegulationSCI,Fed
IT policy requirements, and the ongoing impact of SarbanesOxley (SOX) certification requirements.
Ourresearchshowedthatsuchrequirementsaredrivingbankstocombinedatafromfarmoreinternal
systemsthantheyareaccustomedtointegrating,aswellastocreatemorepowerfulanalyticsthanare
readilyaccessibleintheirlegacysystems.

Aboutaquarteroftheprojectsweredrivenbyadesiretoimproveagility,inthreeways.Thefirstwasto
accelerate the response times for analytics. In business functions such as fraud detection and
automatedtrading,thespeedwithwhichafirmcantestnewideas(i.e.,algorithms)hasadirectimpact
onhowquicklyitcanreacttochangingconditionsintheexternalworld.(Infact,onefirmcreditedtheir
big data platform with enabling an arrest of a fraudulent card user on the day following the first
incident!)Aseconddriver wastoovercomerigiditiesindata integration.Forexample,onebanksaid
thatitsRDBMSorientedprocessforloadingdataintoadatawarehouserequiredmanypersonhours
andmuchcalendartimejusttoincorporateanewattributethatappearedinsourcedata.Finally,some
banks sought agility through increased automation of process bottlenecks.In one case, automating
SOXcomplianceanalysisenabledthebanktoprioritizeamongnumerouspotentialremediationareas
byperformingwhatifanalysesoftheirimpacts.

The remaining quarter of the projects were driven primarily in pursuit of IT cost savings. Cost
containmentcontinuestobeakeystrategicimperativeforbanks,notonlytoboostefficiencybutalso
toconservecapital.Theamountofabank'scapitaldirectlyaffectshowmuchriskitcantakeon,which
inturnaffectsitspotentialrevenue.Morestringentcapitaladequacyrequirementsfromtheregulations
above have only made that equation more difficult. Banks told us that through deployment of new
architecturestohandleworkloadsinthisstudy,theyhavereducedcostsforsoftware,hardware,and
laborandtheyaggressivelyintendtoextendthosesavings.Forexample,firmscitedthelicensecosts
of relational databases, data warehouse appliances, and analytic databases as a key target for
reduction. Asignificant number werealsousing distributed filesystems and lowcost diskstorage to
replacetraditionalSANsandtapebackups.Andsomefirmswereonaclearpathtooffloadworkloads
fromtheirmainframes,inordertoreducecostallocationsforbothmipsandstorage.

Irrespective of their initial business driver, nearly all of the projects enjoyed IT cost savings. We
frequentlyheardbankssay:"Theprojectpaidforitselfquickly."Thisisanexampleofthevirtuouscircle
that attends many disruptive technologies. As illustrated in Figure 2, a business unit that adopts big
datatechnologyforonereasonsoonfindsitselfabletorealizeadditionalbenefits.Inseveralcasesthat
we studied, no matter where the firm started on the circle, it found itself able to move in either
directiononceithadachieveditsoriginalgoals.Forexample,abankthatusedbigdatatechnologyin
order to revolutionize the comprehensivenessof its ITpolicy compliance analysis to meet expanding
regulatoryrequirementsalsoreducedcostssignificantlycomparedtoitspreviousprocess.

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 8

Virtuouscircle:businessdriversandsidebenefits
Expandcapability

Meetnewbusinessneeds
Implementsmarteralgorithms
Includenewtypesofcontent

Reducecosts Increaseagility
Reducesoftwarelicensecosts Respondatthespeedofbusiness
Movetolowcostinfrastructure Integratedatamoreeasily
Automatetoreduceheadcount Removeprocessbottlenecks


Figure2

Natureoftheworkloads

Figure3plotstheworkloadswestudiedagainstthreedimensionsthatmattertoasolutiondesigner:

1. The horizontal axis represents the "structuredness" of the data involved. In casual conversations
(andthetechmedia)thisisoftenpresentedasaBooleanvariable:contentiseitherstructuredor
it'snot.Buttherealityismorecomplex.Forexample,manypeopleconsiderthebodyofanemail
tobeunstructured,sinceithasnorulesforwhatgoeswhere.However,somebanksthinkofemail
messagesasstructureddatabecausetheyarereadilyparsedintoidentifiableconceptsthatcanbe
searched and otherwise operated upon. In the end, most banks view "structuredness" as a
spectrumthatrangesfromhighlystructuredcontentliketransactionrecords(whichtendtohave
predetermined layouts and clearly defined field values) to highly unstructured content such as
video(whereprettymuchtheonlyelementofstructureistime).
2. Theverticalaxiscorrespondsto"analyticcomplexity",whichrangesfromsimpletransformations
atthelowendofthescaletomachinelearningatthehighend.Whilevague,analyticcomplexity
signals the sophistication of the personnel and tools required. The more complex cases require
data scientists and advanced statistical tools, while the less complex are readily handled by
administrators wielding scripts. Analytic complexity is also a proxy for the dataprocessing
intensity of the workloadthat is, how much it exercises compute and I/O. However, it's an
imperfect proxy, since the extent to which a given workload is bottlenecked on processors,
networks, or storage will depend a great deal on the architecture that carries it. For example,

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 9

traversingagraphisefficientinawelltunedgraphdatabasebutcanbehorriblyburdensometoa
relationaldatabase.
3. Thethicknessofeachellipsecrudelyrepresentsthesizeofthedatainvolvedintheworkload.Data
sets less than a petabyte have thin borders, while those of a petabyte or greater have heavy
borders.Notethatapetabyteisanarbitrarydividingline.Lessthanapetabyteofdatamightstill
beconsideredquitehighvolume.

Nature of the workloads encountered


More structure Data types Less structure

Cardfraud Socialanalytics Machine learning


detection fortrading Graph mining

Higher Securitiesfraudearlywarning

IToperationsanalytics
Basictick
analytics
Analytic
Complexity Relational joins
Tradevisibility

ITpolicycomplianceanalytics
Enterprisecredit
riskreporting
Lower
Clientdatatransformation

Archivalforaudit Search
Basic ETL

Transactions Email
Blogs Doc scans Photos
Market data Log files
Tweets Voice Video
Account information
Digital docs

Italicizedbulletsareexamplesofthescale Copyright 2014STAC



Figure3

Itisevidentfromthechartthatabouthalfoftheworkloadswereapetabyteorgreater.Onecanalso
notethatalthoughtherewasarangeof"structuredness"inthecontentcoveredbytheseworkloads,it
was biased toward thestructured end and was utterly unrepresented in the completely unstructured
end (for example, none of our cases involved images or video). In fact, the diagram doesn't entirely
show this bias. Although roughly half of the workloads required integration of data from different
systems(i.e.,spanningthesilos),thosedatawerefairlystructured.Nevertheless,thedatasometimes
varied greatly at the semantic level (e.g., different departments using the same field for a different
purposeorthesamepurposebutusingadifferentvocabulary,etc.).Thissortofvariabilitywasakey
challengeforintegrationinsomecases.

Intermsofanalyticcomplexity,ourcaseswereaboutevenlysplitbetweentherelativelysimpleandthe
relatively complex. Significantly, a key driver of the projects concerning complex analytic workloads
wasimprovingthequalityofthoseanalytics.Thissometimesmeantanalyzinganentiredatasetinstead
of statistical samples. In other cases, it was about supplementing traditional data with new kinds of
content (e.g., email, social media). And sometimes the key to quality was accelerating the analytics.
Decreasingthetimetoanswerenabledmoreiterations,whichinturnledtobettermodels.

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 10

FramingtheseintheThreeV's

It's worth noting that Figure 3 implicitly captures two of the famous "Three V's" popularized by IT
analystsandvendors(volume,variety,andvelocity). 5 Asexplained,thethicknessofanellipseindicates
volume.Anditswidthindicatesvarietyinthedatastructures.SowhataboutthethirdV,velocity?Why
doesn'titfeatureinourconceptualmap?Simplybecauseitonlyaroseinonecase.

Inbigdatacircles,thevelocitychallengeisvariouslydefinedas:

1. Keepingupwithahighrateofdataflowingintothesystem.Thatis,notlosingdata.
2. Quickly making data available for queries as it streams in. That is, not making users wait until
tomorrowtoquerywhathappened6minutesago.
3. Performing eventdriven ("streaming") analytics. That is, enabling algorithms to respond in real
timetoincomingdata.(Notethatthis"push"modelisdifferentfromensuringuptodatedatafor
the"pull"modelofvelocitytype2.)

Justoneofourcaseswasdriventonewtechnologyinpartbyaconcernwithvelocity:socialanalytics
fortrading.Theeventdrivenpatterninthisworkload(asopposedtoitshistoricalanalysispattern)isall
aboutmaximizingthespeedofsemanticanalysisonincomingtextualdata(velocitytype3).Toomuch
latencybetweenamarketmovingtweetandanactionbyatradingalgorithmcanturnprofittoloss.

Asidefromthissinglecase,noneofthecasesofferedupbythebanksweredrivenbyaconcernwith
velocity.Whilesomeofthebanksforecastaneventualrequirementtohandlehighvelocityinfraudor
customerservicescenarios,theydidnothavethatrequirementtoday.Andalthoughit'struethatany
solution for basic tick analytics is sensitive to velocity type 1 (often ingesting data from the world's
busiest exchanges at millions of updates per second), keeping up with ingest rates is not what
motivated banks in this study to use new technologies for tick data. Traditional "tick databases" are
morethancapableofhandlingtoday'singestrates.Themotivationwastoreducecost.

Inourestimation,thetickdataexampleillustratesthegeneralreasonwhydatavelocitydidnotsurface
asadriverofbigdatatechnologyadoptionininvestmentbanks.Formanyyears,thesefirmshavedealt
with higher message rates than most firms in other industries, so their "traditionaltechnologies"are
not challenged. 6 They have built their own software or purchased highly tuned applications and
middlewarefromspecialistvendors.BigdatasoftwarefromSiliconValley,asinnovativeasitis,does
notincreasecapabilityalongthisdimensionatthistime. 7

"Certainmainframeworkloads"

TheworkloadsinFigure3aredefinedintermsoftheirinputs,outputs,andfunctions,irrespectiveof
thetechnologyusedtosupportthem.Table1,however,includesoneentrythatisdefinedintermsof

5
TheearliestreferencewehavefoundwasfromtheMetaGroupin2001.See:
http://blogs.gartner.com/douglaney/files/2012/01/ad9493DDataManagementControllingDataVolume
VelocityandVariety.pdf
6
Thisunderscoresthecontextualnatureofourdefinitionofbigdata.Whileanelectricalutilitycontemplating
howtodealwithaninfluxof100,000smartmeterupdatespersecondislikelytoconsiderthatabigdata
problem,abankaccustomedtoconsumingandactingon6millionupdatespersecondintheequitiesand
optionsmarketisnot.
7
Thisisnottosaythatbigdatatechnologieswillneverenterthehighvelocitynichesinfinance.Asthese
technologiesmatureinotherindustriesthathavevelocityissues(perhapsthosedealingwiththe"Internetof
Things"),bankscouldstarttoadoptthemiftheyreducecostsoroffersomeothercompellingbenefits.

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 11

the technology used to support it: "Certain mainframe workloads". While this category no doubt
overlaps with other workloads in the list (Enterprise Credit Risk Reporting was one example we
encountered),wegaveititsowndesignationbecausethecustomerswhobroughtitupseemedfairly
energeticaboutmovingtheseworkloadstoHadoop.

Thedatasetsintheseworkloadsmaybeofmodestsize(hundredsofTB),buttheamountofprocessing
tendstobelarge.WhattheyhaveincommonisthattheyutilizecertainVSAMfilestructuresthateasily
translatetoHadoopkeyvalueformat.

Thebankssaidthatmainframeresourcesareexpensiverelativetobigdataalternatives.Partofthisis
the cost of mainframe processing time and storage, which is due both to the acquisition cost of the
machineryandsoftwareandtothecostsofoperatingthesystem.Theotherpartwasthecost(andrisk)
ofrelyingonCOBOLprogramming:findinggoodtalentforthislegacyskillsetisgettingharder.

Natureofthesolutions

Whilethemainpointofthisprojectwastolearnaboutbigdataworkloads(whichareindependentof
thetechnologyusedtosupportthematanypointintime),wealsoaskedintervieweesabouttheways
theyhadsolvedthesebigdatachallenges,ifindeedtheyhad.Itturnsoutthatabouttwothirdsofthe
workloads in Table 1 were being handled in production by big data technologies at the time of our
interviews.

Hadoopplayedacentralroleinmostdeploymentsandwasalsothefocusforthosecaseswherethe
solution had not yet been set. Other technologies were often involved (e.g., NoSQL databases), but
Hadoop was at the center. The major exceptions were a tick analytics deployment that used a
documentdatabaseandanITpolicyanalyticsdeploymentbasedonagraphdatabase.ApacheSpark
and other distributed, inmemory technologies also appeared to be gaining mindshare with
interviewees(perhapsmarketshare),particularlyintimesensitiveareaslikesocialanalyticsfortrading.
Onthefrontendofthesolutionstack,severalbanksmentionedRandSASonHadoopasimportant
components,whilePythonandarangeofvisualizationtoolsalsocameintoplay.

TheroleofcorporateIT

Becauseagreatdealofbigdatasoftwareisopensource,itisrelativelyeasyforanytechnologistwithin
a bank to download a product and experiment with it. By lowering barriers to entry, this makes it
temptingforbusinessunitstobuildtheirownbigdatasolutionswithoutrelyingoncentralITgroups.
Not surprisingly, several of the deployments we encountered were at a businessunit level.
Nevertheless,severaloftheminvolvedasharedinfrastructuremanagedbycorporateIT.

Figure 4 offers a way of thinking about the role of corporate IT departments in terms of the service
layers of any analytics proposition. In about half the cases we encountered, corporate IT served as a
centerofexcellenceadvisinglineofbusiness(LOB)technologistsonwhichtechnologiestouseand
how to set them upor had taken on responsibility for managing a number of LOBspecific clusters
(thesearethefirsttwocolumnsinthediagram).

Intheotherhalf,thebankshadbuiltorwereintheprocessofbuildinganewcentralizedplatformto
provide analytic services to several lines of business, which we dubbed an "analytics platform as a
service".Insomecases,theseplatformshadacorporatewidemandateacrossretail,commercial,and
investment banking. The firms said that the benefits of this model included cost efficiency and the
potentialtoexploitshareddata.However,fewofthem,ifany,hadbuiltshareddatapoolsyet(i.e.,an
"enterprisedatahub"),despitealongrunambitiontodoso.Whiledatafromdifferentdepartments

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 12

Roles played by corporate IT in these cases


~50% of cases ~50% of cases

Analytics
Expertise

Data

Infrastructure

Operations

Technical
expertise
Centerof Support Analyticsplatform Enterprisedata Analyticsasa
excellence center asaservice hub service

= Handled by line of business = Handled by IT


Figure4


mightbebroughttogether,theresultingdatastoreswerededicatedtoaparticularbusinesspurpose
(e.g.,usingcommercialbankinformationforthebenefitoftheretailbank).Theydidnotconstitutea
socalled"datalake",fromwhichanylineofbusinesscoulddrink.Norhadanybankachievedendto
end"analyticsasaservice",thoughsomehadanambitiontoprovidethenecessaryanalyticsexpertise.

Analyticsplatformasaservice

Centralizedanalyticsarenotnew.ManyfirmshaveenterprisedatawarehousesandBItoolsthatserve
multiplebusinesses.Whatisnewisthekindofanalyticsinvolved,anincreaseddesiretoleveragedata
acrosstraditionallysiloedbusinesses,andadesiretoshifttheITcostcurveradically.

We considered centralized analytics a "metaworkload": i.e., a workload consisting of multiple


underlyingworkloadsfrommultiplelinesofbusiness.Thatis,centralizedanalyticsplatformsare"multi
tenant."Managingmultitenantworkloadsintroducestechnicalchallengesbeyondthoseimposedby
eachworkloadonitsown,whichmakesthiscaseworthyofitsowndiscussion.

The subworkloads on these platforms spanned both interactive analytics and batch jobs. They
includedseveraloftheworkloadsprofiledelsewhereinthisstudy,plusmanyothers,suchas:

Creditcardprospectingandmarketing
Consumercreditriskanalysis(incorporatingcommercialbankinformation)
Marketingequitiesresearchtoinstitutionalinvestors
Antimoneylaundering
GeneralETLoffloadformultipleusecases.(Thisdoesnotreallyfitthedefinitionof
analyticsbutleveragesthesameinfrastructure.)

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 13

Despitethevarietyofworkloads,thenewanalyticsplatformswestudiedhadafewthingsincommon.
Thepetabytesofdatatheyhousedwereskewedtowardthestructuredendofthespectrum,justaswe
sawfortheindividualworkloadsinFigure1.Theusersoftheplatformwereinternaltothebank(thatis,
directuseoftheservicehadnotyetbeenextendedtocustomers).Andthecoreoftheplatformwasa
Hadoopcluster,fromtenstolowhundredsofnodes.Thiswasusuallyintegratedwithanexistingdata
warehouse,andoftensupplementedwithNoSQLdatabasesandavarietyofanalytictools,including
existingBItools.

The hurdles that they faced along this road were common to any major implementation of a shared
service:

Disasterrecoveryandfailoverplanning.Thisincludesaddressingcrossborderregulations,as
wellasdifferentiatingcriticalgoldensourcedata,whichhashighavailabilityrequirements,
fromexploratorydatawhoselosscouldbetolerated.
Transitionandmigrationcosts(costofhiringanddevelopingstaff).
Governance:Establishingrulesforwhogetswhatresourceswhen.
Providing quick response times to accommodate new workloads, and offering multiple
configurationoptionstomeetworkloadrequirements.

In one interview, an internal user unfavorably compared the flexibility and agility of the firms
centralizedHadoopservicetothatofaparticularcloudbasedHadoopprovider.Whilebusinessgroups
weremostlyprohibitedfromusingsuchoutsourcedalternatives(whichisanotherstory),thosecloud
servicesprovidedanexternalbenchmarkforcostandservicelevelsthatchallengedinternalIT.

KeyChallenges/GapsinBigDataTechnology

Asafinalpointofinquiry,weaskedintervieweesaboutkeychallengesfacingbigdatatechnologiesin
theirinstitutions,eitherfortheusecaseinquestionorwithrespecttoexpandingthetechnologiesto
additionalusecases.Afewthingscameupmorethanonce:

Security and entitlements. Carefully and reliably controlling who can access what data is
fundamentaltofinancialservices,frominvestmentbankingandbrokeragetoretailbanking.In
some cases, inadequate support for access control is holding back adoption of big data
products.Italmostcertainlyprecludesthecreationofcrossbusinessdatalakes.Therequired
functionalityincludeshighperformanceencryptionofdataatrestandinflight,aswellasfine
grainedentitlementscontrolthatcanbemanagedviastandardcorporatesystems.

Multitenancy.Multitenancyisthepracticeofoperatingmultipleindependentapplicationsin
a shared environment where they are competing for underlying resources, e.g. operation
withinaHadoopcluster.Undertheseconditions,theabilitytoshareresourcesanddataacross
multiple businesses while maintaining the required level of service to each is a key concern.
Poorlybehavedapplicationscanbe"noisyneighbors,"consumingsomuchresourcethatthey
degradetheperformanceofotherappsinthecluster.

Product and vendor maturity. Most of the firms we interviewed accepted the high rate of
change in open source big data products as a cost of doing business. And they viewed the
relativelysmallsizeoftherelatedvendorsasariskworthtaking.However,someofthemfelt
theseissueslimitedthefootprintofbigdatatechnologiesintheirorganizations.Inparticular,
product changes that require modifications to applications can be problematic in highly

Copyright 2014 STAC Sponsored by


Big Data Cases in Banking And Securities Page 14

regulatedsectors,whereapplicationownersmustdemonstratethatmodifiedapplicationlogic
continuestosatisfyregulations.

Interop limitations. Most of the big data solutions in our study required integration with
existing systems. The banks seemed fairly happy with interoperability on the back end but
sometimes felt that frontend interop was limited (e.g., popular BI tools suffered reduced
functionalityonHadoop).Othersexpresseddissatisfactionwiththelevelofperformancethey
could achieve when integrating multiple big data infrastructures (e.g., Hadoop and NoSQL
databases).Eachdeliveredgreatperformanceonitsown;butwhencombined,thewholewas
lessthanthesumoftheparts.

Conclusion

Itbearsrepeatingthatwedonotviewthissetofusecasesasrepresentativeofallbigdatausecasesin
retailandinvestmentbanking.Onthecontrary,wethinktheyonlyscratchthesurface,sincewecame
acrossmanyadditionalcasesthatwedidnothavetimetoprofile.However,evenourlimitedsample
points to some fairly robust conclusions. Clearly there are many workloads that banks feel are too
difficultorexpensivetohandleusingtraditionaltechnologiesduetoscaleorcomplexity.Thepotential
ofnewtechnologiestoexpandcapability,improveagility,andreducecostsinsuchcasesissubstantial.
Butthesetechnologiesmustovercomespecifichurdlesinordertobroadentheiracceptabilitytobanks.

In the interplay that will ensue between customer needs and product evolution, we believe that
independent, technologyagnostic benchmark standards will be an important catalyst. They both
acceleratetechnologyselectionatuserfirmsandshortenthesalescycleforvendors.Bydemonstrating
the strengths and weaknesses of any technology stack, such standards also help ensure that
deployments are successful and that products succeed in the market based on their merits. And by
providing vendors with a set of relevant workloads for product developers and a rallying point for
productmarketers,multicustomerbenchmarkspropeltheentireindustryforwardatafasterrate.

We invite both technology users and technology providers to join the STAC Benchmark Council's
projecttoproducethesestandardsforbigdata. 8

8
www.STACresearch.com/bigsig

Copyright 2014 STAC Sponsored by

Das könnte Ihnen auch gefallen