Sie sind auf Seite 1von 44

Creating an Alternative Java

Platform at PayPal – Zing JVM


1. Motivation, Goals
Agenda
2. Zing Features (The Promise)

3. Preparing the Platform (The Reality)

4. Evaluation, Results

5. Potential Benefits

6. Next Steps

7. Credits, References

©2018 PayPal Inc. Confidential and proprietary.


Motivation, Goals
Motivation

• Many Risk services


• Are “Site-Facing” / “Real-Time” / “Customer-Impacting”
• Must process requests within given latency SLA (eg, 850ms, 150ms, ..)
• Are very data intensive (xx MB / request)
• Are implemented in Java (Helix, Raptor) => Garbage Collection!
• Historically have required careful GC tuning

• GC pauses impact latency, cause timeouts => Risk ATB misses


• GC tuning cycle is repetitive and sensitive to workload (endless rinse, repeat…)
• GC algorithms change (CMS, G1, ..)
• Risk algorithms (biz logic, data requirements) change
• Application complexity, data requirements only increase over time
• Latency SLA never increases over time 

©2018 PayPal Inc. Confidential and proprietary.


Motivation – example service-config.xml

©2018 PayPal Inc. Confidential and proprietary.


Risk Platform – Target Pools

Risk Compute Layer – Risk Models

 Raptor 3.x, some docker, production


 VM: 16 vCPU, 100 GB RAM, 1000+ nodes
 RUCS application, 3 pools
 RTCS – risktxncomputeserv, docker
 RGCS – riskgenericcomputeserv
 RUCS – riskunifiedcomputeserv

Risk Decision Layer – Risk Rules


 Raptor 3.x, docker, pre-production (audit mode)
 RUPS – riskunifiedpaymentserv (8 vCPU, 32 GB RAM)
 RUDAES – riskudauthevalserv (4 vCPU, 16 GB RAM)
 Helix, production
 RPDS – riskplanningdecisionserv
 RAES – riskauthevalserv

©2018 PayPal Inc. Confidential and proprietary. 6


Goals

• Can we improve latency?


• Can we maintain latency under higher load (increase efficiency)?
• Can we simplify / eliminate GC tuning?

• If alternative Java platform satisfies above requirements:


• What does it take to implement / maintain at PayPal production scale?
• => Prove in production
• => Evaluate at scale

• Collect data, evaluate / estimate potential benefits


• Input for (potential) commercial negotiation

©2018 PayPal Inc. Confidential and proprietary.


Zing Features (The Promise)
Zing Overview

• Zing - highly scalable, 100% Java 6, 7, and 8 compatible JVM (soon to be 11), optimized for
Linux and x86.
• Eliminates GC pauses as an issue!
• C4 (Continuously Concurrent Compacting Collector)
–No Stop-the-World GC pauses
–Support 1GB to 8T heap sizes
–No/Minimal Tuning
• Zero-overhead always-on runtime diagnostic tool
• ReadyNow! Technology for dynamic & reusable compiler optimizations
• Falcon JIT compiler – flexibility for continued performance optimizations by Azul
Engineering

Azul Systems Confidential


Zing C4 Collector

• Concurrent guaranteed-single-pass marker


– Oblivious to mutation rate
– Concurrent ref (weak, soft, final) processing
• Concurrent Compactor
– Objects moved without stopping mutator
– References remapped without stopping mutator
– Can relocate entire generation (New, Old) in every GC cycle
• Concurrent, compacting old generation
• Concurrent, compacting new generation
• No stop-the-world fallback
– Always compacts, and always does so concurrently

Azul Systems Confidential


Zing Memory Allocation

Azul Systems Confidential


Typical Java GC tuning

• Java -Xmx12g -XX:MaxPermSize=64M -XX:PermSize=32M -XX:MaxNewSize=2g


-XX:NewSize=1g -XX:SurvivorRatio=128 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=0
-XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled
-XX:+UseCMSInitiatingOccupancyOnly -XX:ParallelGCThreads=12
-XX:LargePageSizeInBytes=256m …
• Java –Xms8g –Xmx8g –Xmn2g -XX:PermSize=64M -XX:MaxPermSize=256M
-XX:-OmitStackTraceInFastThrow -XX:SurvivorRatio=2 -XX:-UseAdaptiveSizePolicy
-XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled
-XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled
-XX:CMSMaxAbortablePrecleanTime=10000 -XX:
+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=63 -XX:+UseParNewGC –Xnoclassgc …

Azul Systems Confidential


Zing GC tuning

• Java -Xmx40g

Azul Systems Confidential


Zing GC tuning

Azul Systems Confidential


Zing ReadyNow

• Eliminate unnecessary deoptimization


• Aggressive class loading – load and not initialize
• Implementation of Compiler API
• Yesterday’s optimizations for today’s runs

Azul Systems Confidential


Preparing the Platform (The Reality)
Creating a new platform

• NDA
• Vendor onboarding
• Evaluation license negotiation
• InfoSec review
• Infrastructure: RHEL? Ubuntu 14? Ubuntu 16? Docker?
• Platform: Helix? Raptor?
• Installation process?
• Build & deployment process?

©2018 PayPal Inc. Confidential and proprietary.


Zing deployment architecture – Raptor & Docker

• ZST (memory mgmt module, tools) – on host OS/VM (!)


• ZST/docker, Zing JVM (JDK), license file – on Docker base image
• Additional libs/JARs – on Docker base image (*)

©2018 PayPal Inc. Confidential and proprietary.


Zing deployment architecture – Raptor & Docker

• ZST pre-installed on latest VM production patch (Ubuntu 14.04, kernel 4.4.0-133)


• https://github.paypal.com/PaaS/hostutils/wiki/paas-base-ubuntu-20180827
• New VM instances in production and staging provision latest patch
• Existing production systems undergo patching cycles

• Zing Docker base image, build pipeline customization provided by Raptor team
• https://github.paypal.com/FrameworksRaptor-R/zingJVM
• Easy to use, simple application build customization (!)

• Customized build pipeline combines regular application build with Zing Docker base image
• Produces application Docker image w Raptor application & Zing JVM
• Deploys (only) on VM w ZST
• (Genesis stages not on 4.4.0-133 yet)

©2018 PayPal Inc. Confidential and proprietary.


Raptor build pipeline customization

• In maven release pipeline, add Post-build Actions - Additional Parameters:


• -Dassembler.dockerTemplatesUrl=
https://paypalcentral.es.paypalcorp.com/nexus/content/repositories/plugins/com/paypal/paas/docker/raptor/docker-zing/exp-jan-2019-
1.0.1/docker-zing-exp-jan-2019-1.0.1.zip
• -Dassembler.deployScriptsUrl=
https://paypalcentral.es.paypalcorp.com/nexus/content/repositories/snapshots/com/ebayinc/platform/devex/Deployer/PP300_34-
SNAPSHOT/Deployer-PP300_34-20180827.154447-25-perl.zip

©2018 PayPal Inc. Confidential and proprietary.


What about JSSE?

• PayPal Infra JSSE checks JVM vendor (used to be incompatible?)


• https://github.paypal.com/PlatformSecurity-R/infra-jsse-sslsession-cache-
provider/blob/release/src/main/java/com/paypal/infra/ssl/PayPalJSSEProperties.java#L115

• Works 

©2018 PayPal Inc. Confidential and proprietary.


Evaluation, Results
RTCS latency, peak hours

©2018 PayPal Inc. Confidential and proprietary.


RTCS CPU, peak hours

©2018 PayPal Inc. Confidential and proprietary.


RTCS latency, peak hours

• RTCS VM: 16 vCPU, 100GB RAM


• Test: 24GB heap (same as Oracle), no tuning
• Progressively greater benefit with increasing percentile. Almost ~200ms faster, ~35% better at P99
• Maintains latency as throughput increases (compare SLCA low traffic vs DCG13 high traffic)

Build Label Colo Count Failure Count


Failure % Min (ms) Max (ms) Avg (ms) Median StdDev 95.0 %ile(ms)
99.0 %ile(ms)
riskunifiedcomputeserv-3.112.20180927195426870_rtcs_20180927195426870,stack-raptor,3.6.2;***;unset
dcg13 515020 4 0.000777 0 1736 227.1939 206.9298 87.27528 400.8821 545.2358
riskunifiedcomputeserv-3.112.20180928013924623_rtcs_20180928013924623,stack-raptor,3.6.2;***;unset
dcg13 4262 0 0 36 553 186.0652 175.3258 56.13245 304.1082 357.6881

riskunifiedcomputeserv-3.112.20180927195426870_rtcs_20180927195426870,stack-raptor,3.6.2;***;unset
slca 39483 1 0.002533 12 1894 149.1375 137.5333 77.56034 292.4554 367.5646
riskunifiedcomputeserv-3.112.20180928013924623_rtcs_20180928013924623,stack-raptor,3.6.2;***;unset
slca 590 0 0 12 387 139.0932 129.58 68.57365 266.9125 343.11

9/28/18 11am - 12pm dcg13 diff msec: 1183 41.12863 96.77393 187.5477
diff %: 68.14516 18.10288 24.14025 34.39753

slca diff msec: 1507 10.04423 25.54286 24.45456


diff %: 79.56705 6.734882 8.733935 6.653133

©2018 PayPal Inc. Confidential and proprietary.


RTCS TPM, peak hours, Zing 2x traffic

©2018 PayPal Inc. Confidential and proprietary.


RTCS CPU, peak hours, Zing 2x traffic

©2018 PayPal Inc. Confidential and proprietary.


RTCS latency, peak hours, Zing 2x traffic

• RTCS VM: 16 vCPU, 100GB RAM


• Test: 24GB heap (same as Oracle), no tuning
• Maintains latency as throughput increases (2x traffic test). High percentiles better than Oracle 1x
traffic
• Also tested older Zing C2 JIT compiler vs newest Falcon JIT (default)
Build Label Colo Count Failure Count
Failure % Min (ms) Max (ms) Avg (ms) Median StdDev 95.0 %ile(ms)
99.0 %ile(ms)
riskunifiedcomputeserv-3.113.4_rtcs_noaerospike_20181010024430971,stack-raptor,3.6.2;***;unset
dcg13 901703 9 0.000998 0 2253 237.4642 214.9349 86.76496 411.2589 557.873
riskunifiedcomputeserv-3.112.20181009000545738_rtcs_lnp_20181009000545738,stack-raptor,3.6.2;***;unset
dcg13 15642 0 0 90 930 264.4271 256.5768 76.69193 388.2483 478.5151
riskunifiedcomputeserv-3.112.20181010185011884_rtcs_lnp_20181010185011884,stack-raptor,3.6.2;***;unset
dcg13 15664 0 0 91 936 259.5916 252.6083 72.02329 384.2889 463.918

10/11/18 11:00 - 13:00 2x Traffic diff msec: 1323 -26.9629 23.01053 79.35785
Note: Zing boxes run 2x traffic +UseC2 diff %: 58.7217 -11.3545 5.595145 14.22507

2x Traffic diff msec: 1317 -22.1274 26.96998 93.95496


diff %: 58.45539 -9.31822 6.557909 16.84164

©2018 PayPal Inc. Confidential and proprietary.


RTCS TPM, peak hours, Zing 3x traffic, 24GB & 60GB heap

©2018 PayPal Inc. Confidential and proprietary.


RTCS CPU, peak hours, Zing 3x traffic, 24GB & 60GB heap

©2018 PayPal Inc. Confidential and proprietary.


RTCS latency, peak hours, Zing 3x traffic, 24GB & 60GB heap

• RTCS VM: 16 vCPU, 100GB RAM


• Test: 24GB heap (same as Oracle) and 60GB heap, no tuning
• Good CPU profile with 60GB heap
• High percentiles better than Oracle 1x traffic. About ~150ms faster, ~25% better at P99

Build Label Colo Count Failure Count


Failure % Min (ms) Max (ms) Avg (ms) Median StdDev 95.0 %ile(ms)
99.0 %ile(ms)
riskunifiedcomputeserv-3.117.1_20181023040926391,stack-raptor,3.6.2;***;unset
dcg13 710023 37 0.005211 0 2364 250.9773 224.0962 100.8186 457.5017 623.0563
riskunifiedcomputeserv-3.112.20181010185011884_rtcs_lnp_20181010185011884,stack-raptor,3.6.2;***;unset
All 13362 0 0 5 961 366.8135 359.1465 102.1664 544.196 644.082
riskunifiedcomputeserv-3.117.1_lnp_20181023185911957,stack-raptor,3.6.2;***;unset
All 13359 0 0 76 2613 273.6039 269.6309 75.51762 397.11 471.6934

10/25/18 13:00 - 14:00 3x Traffic diff msec: 1403 -115.836 -86.6943 -21.0257
Note: Zing boxes run 3x traffic 24GB diff %: 59.34856 -46.1541 -18.9495 -3.3746

3x Traffic diff msec: -249 -22.6266 60.3917 151.363


60GB diff %: -10.533 -9.01538 13.20032 24.29362

©2018 PayPal Inc. Confidential and proprietary.


RGCS latency, peak hours

• RGCS VM: 16 vCPU, 100GB RAM


• RGCS models are more lightweight vs RTCS
• Test: 24GB heap (same as Oracle), no tuning
• Progressively greater benefit with increasing percentile. Almost ~60ms faster, ~25% better at P99

Build Label Colo Count Failure Count


Failure % Min (ms) Max (ms) Avg (ms) Median StdDev 95.0 %ile(ms)
99.0 %ile(ms)
riskunifiedcomputeserv-3.113.4_rtcs_noaerospike_20181010024430971,stack-raptor,3.6.2;***;unset
dcg13 1039031 1 9.62E-05 2 1334 66.78721 58.89613 40.54734 131.25 216.0075
riskunifiedcomputeserv-3.112.20181010185011884_rtcs_lnp_20181010185011884,stack-raptor,3.6.2;***;unset
dcg13 16255 0 0 3 262 61.88502 55.72537 30.28953 116.1765 158.3844

10/11/18 11:00 - 13:00 diff msec: 1072 4.902189 15.07353 57.6231


diff %: 80.35982 7.340012 11.48459 26.67643

©2018 PayPal Inc. Confidential and proprietary.


RUDAES latency

• RUDAES VM: 4 vCPU, 16GB RAM


• Relatively light data load (currently only Tier1 Login checkpoint)
• Test: 4GB heap (same as Oracle), 8GB heap, no tuning, limited duration test
• Latency marginally better. Higher CPU consumption observed (~20% vs ~10%), both 4GB and 8GB
heap
• Further analysis required to understand CPU consumption

Build Label Colo Count Failure Count


Failure % Min (ms) Max (ms) Avg (ms) Median StdDev 95.0 %ile(ms)
99.0 %ile(ms)
riskudauthevalserv-1.0.0_20180913002301757,stack-raptor,3.7.0;***;unset
dcg13 434893 3 0.00069 0 1048 41.8275 29.50323 52.60037 151.9565 269.139
riskudauthevalserv-1.0.0_20181005113945927,stack-raptor,3.7.0;***;unset
dcg13 17811 0 0 4 656 36.31666 29.76976 49.09285 144.4167 252.58

10/5/18 14:00 - 15:30 diff msec: 392 5.510837 7.539785 16.55895


diff %: 37.40458 13.17515 4.961806 6.152566

©2018 PayPal Inc. Confidential and proprietary.


RUDAES CPU

©2018 PayPal Inc. Confidential and proprietary.


Potential Benefits
Benefits

• Higher compute utilization – infrastructure efficiency


• ~$200 / VM / month
• RTCS alone runs on ~1000 nodes; what if we can cut in half?

• Eliminate GC tuning – developer productivity


• ~x engineers? ~$100k / engr / year?
• Work on developing features, not GC tuning

• Lower latency – fewer timeouts, higher Risk ATB


• Incremental fraud benefit (~$250 / bps ATB / day)
• What if we can lift by 10 bps (99.8% -> 99.9%)?

• Lower latency – faster response


• Incremental conversion benefit ($xxM additional revenue?)
• User experience (priceless?)

©2018 PayPal Inc. Confidential and proprietary.


Benefits – conversion benefit estimation model

attempts (in $B) decision speed conv rate population TPV (in $B)
$500 fast 86% 37.0% $159.10
ok 83% 50.0% $207.50
slow 80% 10.0% $40.00
3.0%
baseline $12.20 $B revenue (3% take)

attempts (in $B) decision speed conv rate population TPV (in $B)
$500 fast 86% 37.0% $159.10
ok 83% 50.1% $207.92
slow 80% 10.0% $40.00
2.9%
$12.21 $B revenue (3% take)
-$12.20 $B revenue baseline

0.1% limited lift $12.45 $M revenue lift

©2018 PayPal Inc. Confidential and proprietary.


Next Steps
Next Steps – Technical Track

• Technical Track
• Rollout at full AZ scale (eg, DCG12), measure end-to-end impact (customer-perceived latency)

• Test big heaps


• Eg, RTCS 64 GB heap, leverage idle RAM (!)

• Test ReadyNow JIT feature


• What is behavior (speedup) on restart/redeploy?

• Test on bare metal


• Eg, 64 core, 512 GB RAM, no VM, 1 JVM (on Docker)
• Can we achieve higher efficiency?
• Potential 10-bagger? From ~1000 VMs to ~100 machines (!)

• Test in performance engineering lab


• Controlled environment, micro-benchmarking

©2018 PayPal Inc. Confidential and proprietary.


Next Steps – Business Track

• Business Track
• Weigh costs vs benefits
• Negotiate licensing terms

©2018 PayPal Inc. Confidential and proprietary.


Credits, References
Credits – Thank You!

eServ/Risk Platform
• RUCS: David Zhang, Devin Wu, Levi Li, Ye Fu, Tony Wu, Simon Zhang
• RUPS: Srini Manoharan
• RUDAES: Qinghai Fu, Shibo Wu
• Vendor mgmt: Ajay Phulwadhwa

DAMA/Risk Analytics
• Ming Ouyang

CPI
• Raptor: Srividya Venkataraman, John Nutting, Adnan Prcic, Rama Kolli
• Cloud: Anderson Dang, Shyam Patel
• SRE: Jenny Chen, Joe Cornett

©2018 PayPal Inc. Confidential and proprietary.


References

PayPal Internal
• https://engineering.paypalcorp.com/confluence/display/ADD/Zing+JVM+evaluation
• https://engineering.paypalcorp.com/confluence/display/RaptorServices/Zing+JVM+for+raptor
• https://github.paypal.com/FrameworksRaptor-R/zingJVM
• https://github.paypal.com/PaaS/hostutils/wiki/paas-base-ubuntu-20180827
• Slack: #zingjvmtesting

Azul Documentation
• https://www.azul.com/products/zing/whatisit/
• https://www.azul.com/products/zing/jvm-tuning/

©2018 PayPal Inc. Confidential and proprietary.


Thank You! … Questions?

Das könnte Ihnen auch gefallen