Beruflich Dokumente
Kultur Dokumente
Avro
serialize deserialize
deserialize serialize
• SOAP
• CORBA
• DCOM, COM+
• JSON, Plain Text, XML
Should we pick up one of those? (no)
• SOAP
• XML, XML and more XML. Do we really need to parse so much XML?
• CORBA
• Amazing idea, horrible execution
• Overdesigned and heavyweight
• DCOM, COM+
• Embraced mainly in windows client software
• HTTP/JSON/XML/Whatever
• Okay, proven – hurray!
• But lack protocol description.
• You have to maintain both client and server code.
• You still have to write your own wrapper to the protocol.
• XML has high parsing overhead.
• (relatively) expensive to process; large due to repeated tags
Decision Time?
{"deposit_money": "12345678"}
JSON Binary
'0x6d', '0x6f', '0x6e', '0x01', '0xBC614E'
'0x65', '0x79', '0x31',
'0x32', '0x33', '0x34',
'0x35', '0x36', '0x37',
'0x38'
JSON Binary
Push down automata No parser needed. The
(PDA) parser (LL(1), binary representation IS
LR(1)) -- 1 character [as close as to] the
lookahead. Then, final machine representation.
translation from
characters to native
types (int, float, etc)
JSON Binary
Brainless to learn Need to manually write
Popular code to define message
packets (total pain and
error prone!!!)
or
Here is where
Data Interchange Protocols
comes in play…
Serialization Frameworks
XML, JSON,
• Designed ~2001 because everything else wasn’t that good those days
• Every time you hit a Google page, you're hitting several services and several PB
code
• Official support for four languages: C++, Java, Python, and JavaScript
• Does have a lot of third-party support for other languages (of highly variable quality)
• BSD License
Apache Thrift
• IDL syntax is slightly cleaner than PB. If you know one, then you know the other
• Supports: C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa,
JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages
• DO NOT EDIT!
Thrift Principle of Operation
Interface Definition Language (IDL)
• The new frameworks use their own languages, that are not based
on XML.
• The "= 1", " = 2" or " 1:", " 2:" markers on each element identify
the unique "tag" that field uses in the binary encoding.
...
import bank_example.BankDepositMsg;
...
BankDepositMsg my_transaction = new BankDepositMsg();
my_transaction.setUser_id(123);
my_transaction.setAmount(1000.00);
my_transaction.setDatestamp(new Timestamp(date.getTime()));
...
In Java (and other compiled languages) you have the getters and the setters, so that if
the fields and types are erroneously changed the compiler will inform you of the
mistake.
The Comparison…
Thrift Protocol Buffers
Composite Type Struct {} Message {}
Base Types bool bool
byte 32/64-bit integers
16/32/64-bit integers float
double double
string string
byte sequence
Containers list<t1>: An ordered list of elements of type t1. No
May contain duplicates.
set<t1>: An unordered set of unique elements of
type t1.
map<t1,t2>: A map of strictly unique keys of type
t1 to values of type t2.
• Size Comparison
• Runtime Performance
Size Comparison
Each write includes one Course object with 5 Person objects, and one Phone
object.
TBinaryProtocol – not optimized
for space efficiency. Faster to
process than the text protocol but
more difficult to debug.
TCompactProtocol – More
compact binary format; typically
more efficient to process as well
• Test Scenario
• This scenario is executed 10,000 times. The tests were run on the
following systems:
• The combination of this field identifiers and its type specifier is used
to uniquely identify the field.
BankDepositMsg BankDepositMsg
BankDepositMsg BankDepositMsg
branch_id: None
BankDepositMsg BankDepositMsg
branch_id: 1333
• Facebook
• Cassandra project
• Hadoop supports access to its HDFS API through Thrift bindings
• HBase leverages Thrift for a cross-language API
• Hypertable leverages Thrift for a cross-language API since v0.9.1.0a
• LastFM
• DoAT
• ThriftDB
• Scribe
• Evernote uses Thrift for its public API.
• Junkdepot
Projects Using Protobuf
• Google
• Netty (protobuf-rpc)
Better documentation
Includes RPC implementation for services
API a bit cleaner than Thrift
Good examples are hard to find .proto can define services, but no RPC
Cons implementation is defined (although stubs are
Missing/incomplete documentation generated for you).
I’d choose Protocol Buffers over Thrift, If:
• When Avro data is read, the schema used when writing it is always present.
• Avro data is always serialized with its schema. When Avro data is stored in a file,
its schema is stored with it, so that files may be processed later by any program.
• The schemas are equivalent to protocol buffers proto files, but they do not have to
be generated.
• Official support for four languages: Java, C, C++, C#, Python, Ruby
• An RPC framework.
// Avro IDL:
{ "type": "record",
"name": "BankDepositMsg",
"fields" : [
{"name": "user_id", "type": "int"},
{"name": "amount", "type": "double", "default": "0.00"},
{"name": "datestamp", "type": "long"}
]
}
• JSON array
• Primitive types: null, boolean, int, long, float, double, bytes, string
• {"type": "string"}
• Dynamic typing: Avro does not require that code be generated. Data is
always accompanied by a schema that permits full processing of that
data without code generation, static datatypes, etc.