Sie sind auf Seite 1von 36

Prometheus Patch:

#674 and You


@stuhood Cassandra SF July 11th, 2011

Monday, July 11, 2011

Problems?
What Problems?

Monday, July 11, 2011

Problem Overview

Compression Efficient Compaction True Random Access Range/Slice Deletes Corruption Recovery

Monday, July 11, 2011

Compression

#47 Get free* memory!

*for a small cpu fee

Between 2x and 10x Immutable files, low-hanging fruit, etc.

Monday, July 11, 2011

Efficient Compaction

#16 ? Row width stored at the head of a row on disk


Must be calculated at write time Doesnt always fit in memory (wide rows)

Improved with two pass compaction in 0.7

But, wide rows must still seek:

once for each input or output

Monday, July 11, 2011

True Random Access


#2319 Wide rows are second class citizens:

Does a column slice exist?

Lets seek twice to find out

Fetch me this column!

Total of three seeks

Should treat key+name as compound

Monday, July 11, 2011

Range/Slice Deletes

#293 / #494 Should be able to delete:


A range of keys A slice of columns

Prerequisite for more natural supercolumn representation:


Compound column names Delete children by deleting a slice

Monday, July 11, 2011

Corruption Recovery

#808 / #1717 Failing drives cause silent errors

Some RAID modes wont check mismatches

Should be able to react:


Skip? Fail? Repair from neighbor?

Monday, July 11, 2011

Solutions?
674 + 2319: A Holistic Approach

Monday, July 11, 2011

Solution Overview

Compression Efficient Compaction True Random Access Range/Slice Deletes Corruption Recovery

Monday, July 11, 2011

Compression

Type specific (#2398)

AbstractTypes can (optionally) override default

LongType: varint and delta encoding

Series compress to ~1 byte/entry

CounterColumnType

Series compress to ~2*rf bytes/entry


Monday, July 11, 2011

default: LZF

Applied to keys, names, values, timestamps

Efficient Compaction

Blocks with arbitrary number of rows/columns

No special casing for wide rows

Drops ~1600 lines of multi-pass code

Configurable size (64k default) Never larger than memory (unless a column is)

One-pass compaction

Buffer a block, compress, checksum, flush

Monday, July 11, 2011

True Random Access

SSTable index file as simplified data file (#2319)


Type compressed (sorted order == tiny) But:

sparse names, no values

Allows random access to wide rows

One seek to eliminate an sstable

Great for timeseries

Monday, July 11, 2011

Two seeks to read columns (as with narrow rows)

Range/Slice Deletes

Range/Slice metadata

This range/slice was deleted at time=X

Timestamps, caps encoded with column data

Type compressed together

Row, supercolumn deletes already implemented via range/slice metadata


(not currently supported in API)

Monday, July 11, 2011

Corruption Recovery

Length at head of block with checksum at tail

Was length corrupt?

MAGIC header Corrupt length? Failed checksum?


Scan for MAGIC (not implemented!) Request repair from replica (not implemented!) Fail noisily (totally implemented!)

Monday, July 11, 2011

#674 Disk Layout

Monday, July 11, 2011

Block Example

key1

name1

value1

name2

value2

key2

name1

value1

Monday, July 11, 2011

(... ~64K of rows/columns)

Block Example

key1

name1

value1

name2

value2

key2

name1

value1

Monday, July 11, 2011

(... ~64K of rows/columns)

The Block

One chunk per tree level

keys, (supernames), names, values

Read a column:

Decode all chunks

Miss a key / column:

Only decode relevant chunks

Monday, July 11, 2011

The Chunk

Entries for a particular level of a tree Entry enum byte per entry (next slide) Values at a level are all the same AbstractType

Type compressed

Local and client timestamps


LongType compressed Bitset for nulls

Monday, July 11, 2011

Entry Enum

Entry enum byte per entry in a chunk

NAME, COUNTER, RANGE_BEGIN/END, STANDARD, EXPIRING, etc.

Chunk decoded as state machine:


1. Read next enum 2. Interesting?


true: lazily decode value, timestamps false: skip entry by bumping positions

Monday, July 11, 2011

Block Example:
Key Chunk
enums values clientts localts
Monday, July 11, 2011

RANGE_BEGIN_NULL, PARENT, PARENT, RANGE_END_NULL

key1, key2 <null>, <null> <null>, <null>

Block Example:
Name Chunk
enums values clientts localts
Monday, July 11, 2011

RANGE_BEGIN_NULL, NAME, NAME, RANGE_END_NULL, RANGE_BEGIN_NULL, NAME, RANGE_END_NULL

name1, name2, name1 123000, 124000 123, 124

Block Example:
Name Chunk
enums values clientts localts
Monday, July 11, 2011

RANGE_BEGIN_NULL, NAME, NAME, RANGE_END_NULL, RANGE_BEGIN_NULL, NAME, RANGE_END_NULL

name1, name2, name1 123000, 124000 123, 124

Block Example:
Value Chunk
enums values clientts localts
Monday, July 11, 2011

STANDARD, DELETED, STANDARD

value1, value1 233000, 234000, 235000 233, 234, 235

The Chunk (encoded)

Encoded disk layout:


byte[ ] MAGIC; // magic resync value int length; // length of encoded content ByteBuffer content; // encoded content

(decoded on next slide)

int checksum; // MurmurHash of content

Monday, July 11, 2011

The Chunk (decoded)

Decoded content:

ByteBuffer enums; // enums for entries ByteBuffer[ ] values; // values for level long[ ] clientTS; // client timestamps Bitset clientTSNulls; // bit per client ts int[ ] localTS; // local timestamps Bitset localTSNulls; // bit per local ts

Monday, July 11, 2011

Point Read Path


0. Use promoted index file to find block for col. 1. Seek directly to block in data file 2. Decode key, name, value chunks

optionally check chunk-checksums (at most 64k (or configurable) decoded)

Monday, July 11, 2011

Current Status

Monday, July 11, 2011

Current Status

Wide row support finished June 24th Narrow row support finished July 9th Ready for wider testing! Remaining issues for #647 + #2319

Reversed slices are stubbed Scrub is stubbed Two to three streaming stubs

Monday, July 11, 2011

Future Optimizations

Monday, July 11, 2011

Scan Read Path


Find with value/name X (not implemented) 1. Scan to first block 2. Decode only value/name chunk 3. Hit?

true: decode other chunks false: ignore other chunks

Monday, July 11, 2011

Compaction Path

Compact without decoding (not implemented) 0. Synchronize blocks across files 1. Decode key chunk 2. Any keys / ranges intersecting?

true: decode other chunks, resolve false: append encoded chunks

Monday, July 11, 2011

Chunk per Column


Narrow row type optimization (not implemented) Split columns with defined validators into separate chunks

Ex. chunks: age:int, location:geo, name:utf8

Decode only chunks for requested columns

Monday, July 11, 2011

thats it!

Questions ?
Monday, July 11, 2011

Were hiring @jointheflock


Monday, July 11, 2011

Das könnte Ihnen auch gefallen