Beruflich Dokumente
Kultur Dokumente
4-1
Data Representation
"PALO ALTO, Calif. Intel says its Pentium Pro and new Pentium II chips have a flaw that can cause computers to sometimes make mistakes but said the problems could be fixed easily with rewritten software" from a Reuters telegram
Overview
This chapter describes a number of modules that can be used to convert between Python objects and other data representations. They are often used to read and write foreign file formats, and to store or transfer Python variables.
Binary data
Python provides several support modules that help you decode and encode binary data formats. The struct module can convert between binary data structures (like C structs) and Python tuples. The array module wraps binary arrays of data (C arrays) into a Python sequence object.
Self-describing formats
To pass data between different Python programs, you can marshal or pickle your data. The marshal module uses a simple self-describing format which supports most built-in data types, including code objects. Python uses this format itself, to store compiled code on disk (in PYC files). The pickle module provides a more sophisticated format, which supports user-defined classes, selfreferencing data structures, and more. This module is available in two versions; the basic pickle module is written in Python, and is relatively slow, while cPickle is written in C, and is usually as fast as marshal.
Python Standard Library Copyright (c) 1999-2003 by Fredrik Lundh. All rights reserved.
4-2
Output formatting
This group of modules supplement built-in formatting functions like repr, and the % string formatting operator. The pprint module can print almost any Python data structure in a nice, readable way (well, as readable as it can make things, that is). The repr module provides a replacement for the built-in function with the same name. The version in this module applies tight limits on most things; it doesn't print more than 30 characters from each string, it doesn't print more than a few levels of deeply nested data structures, etc.
4-3
The array objects can be treated as ordinary lists, to some extent. You cannot concatenate arrays if they have different type codes, though.
Python Standard Library: Data Representation Example: Using arrays as ordinary sequences
# File:array-example-2.py import array a = array.array("B", [1, 2, 3]) a.append(4) a=a+a a = a[2:-2] print a print repr(a.tostring()) for i in a: print i, array('B', [3, 4, 1, 2]) '\003\004\001\002' 3412
4-4
This module also provides a very efficient way to turn raw binary data into a sequence of integers (or floating point values, for that matter): Example: Using arrays to convert strings to lists of integers
# File:array-example-3.py import array a = array.array("i", "fish license") # signed integer print a print repr(a.tostring()) print a.tolist() array('i', [1752394086, 1667853344, 1702063717]) 'fish license' [1752394086, 1667853344, 1702063717]
Python Standard Library: Data Representation Finally, here's how to use this module to determine the endianess of the current platform: Example: Using the array module to determine platform endianess
# File:array-example-4.py import array def little_endian(): return ord(array.array("i",[1]).tostring()[0]) if little_endian(): print "little-endian platform (intel, alpha)" else: print "big-endian platform (motorola, sparc)" big-endian platform (motorola, sparc)
4-5
Python 2.0 and later provides a sys.byteorder attribute, which is set to either "little" or "big": Example: Using the sys.byteorder attribute to determine platform endianess (Python 2.0)
# File:sys-byteorder-example-1.py import sys # available in Python 2.0 and later if sys.byteorder == "little": print "little-endian platform (intel, alpha)" else: print "big-endian platform (motorola, sparc)" 'big-endian platform (motorola, sparc)'
4-6
4-7
The XDR format is used by Sun's remote procedure call (RPC) protocol. Here's an incomplete (and rather contrived) example showing how to build an RPC request package:
Python Standard Library: Data Representation Example: Using the xdrlib module to send a RPC call package
# File:xdrlib-example-2.py import xdrlib # some constants (see the RPC specs for details) RPC_CALL = 1 RPC_VERSION = 2 MY_PROGRAM_ID = 1234 # assigned by Sun MY_VERSION_ID = 1000 MY_TIME_PROCEDURE_ID = 9999 AUTH_NULL = 0 transaction = 1 p = xdrlib.Packer() # send an Sun RPC call package p.pack_uint(transaction) p.pack_enum(RPC_CALL) p.pack_uint(RPC_VERSION) p.pack_uint(MY_PROGRAM_ID) p.pack_uint(MY_VERSION_ID) p.pack_uint(MY_TIME_PROCEDURE_ID) p.pack_enum(AUTH_NULL) p.pack_uint(0) p.pack_enum(AUTH_NULL) p.pack_uint(0) print repr(p.get_buffer()) '\000\000\000\001\000\000\000\001\000\000\000\002\000\000\004\322 \000\000\003\350\000\000\'\017\000\000\000\000\000\000\000\000\000 \000\000\000\000\000\000\000'
4-8
4-9
Python Standard Library: Data Representation The marshal module can also handle code objects (it's used to store precompiled Python modules). Example: Using the marshal module to serialize code
# File:marshal-example-2.py import marshal script = """ print 'hello' """ code = compile(script, "<script>", "exec") data = marshal.dumps(code) # intermediate format print type(data), len(data) print "-"*50 print repr(data) print "-"*50 exec marshal.loads(data) <type 'string'> 81 -------------------------------------------------'c\000\000\000\000\001\000\000\000s\017\000\000\00 0\177\000\000\177\002\000d\000\000GHd\001\000S(\00 2\000\000\000s\005\000\000\000helloN(\000\000\000\ 000(\000\000\000\000s\010\000\000\000<script>s\001 \000\000\000?\002\000s\000\000\000\000' -------------------------------------------------hello
4-10
4-11
4-12
On the other hand, pickle cannot handle code objects (but see the copy_reg module for a way to fix this). By default, pickle uses a text-based format. You can also use a binary format, in which numbers and binary strings are stored in a compact binary format. The binary format usually results in smaller files. Example: Using the pickle module in binary mode
# File:pickle-example-2.py import pickle import math value = ( "this is a long string" * 100, [1.2345678, 2.3456789, 3.4567890] * 100 ) # text mode data = pickle.dumps(value) print type(data), len(data), pickle.loads(data) == value # binary mode data = pickle.dumps(value, 1) print type(data), len(data), pickle.loads(data) == value
4-13
4-14
4-15
We can work around this by registering a code object handler. Such a handler consists of two parts; a pickler which takes the code object and returns a tuple that can only contain simple data types, and an unpickler which takes the contents of such a tuple as its arguments: Example: Using the copy_reg module to enable pickling of code objects
# File:copy-reg-example-2.py import copy_reg import pickle, marshal, types # # register a pickle handler for code objects def code_unpickler(data): return marshal.loads(data) def code_pickler(code): return code_unpickler, (marshal.dumps(code),) copy_reg.pickle(types.CodeType, code_pickler, code_unpickler) # # try it out CODE = """ print "suppose he's got a pointed stick" """ code = compile(CODE, "<string>", "exec") exec code exec pickle.loads(pickle.dumps(code)) suppose he's got a pointed stick suppose he's got a pointed stick
If you're transferring the pickled data across a network, or to another program, the custom unpickler must of course be available at the receiving end as well.
Python Standard Library: Data Representation For the really adventurous, here's a version that allows you to pickle open file objects: Example: Using the copy_reg module to enable pickling of file objects
# File:copy-reg-example-3.py import copy_reg import pickle, types import StringIO # # register a pickle handler for file objects def file_unpickler(position, data): file = StringIO.StringIO(data) file.seek(position) return file def file_pickler(code): position = file.tell() file.seek(0) data = file.read() file.seek(position) return file_unpickler, (position, data) copy_reg.pickle(types.FileType, file_pickler, file_unpickler) # # try it out file = open("samples/sample.txt", "rb") print file.read(120), print "<here>", print pickle.loads(pickle.dumps(file)).read() We will perhaps eventually be writing only small modules which are identified by name as they are used to build larger <here> ones, so that devices like indentation, rather than delimiters, might become feasible for expressing local structure in the source language. -- Donald E. Knuth, December 1974
4-16
4-17
4-18
4-19
In addition, the = character is used for padding at the end of the data stream. The encode and decode functions work on file objects: Example: Using the base64 module to encode files
# File:base64-example-1.py import base64 MESSAGE = "life of brian" file = open("out.txt", "w") file.write(MESSAGE) file.close() base64.encode(open("out.txt"), open("out.b64", "w")) base64.decode(open("out.b64"), open("out.txt", "w")) print "original:", repr(MESSAGE) print "encoded message:", repr(open("out.b64").read()) print "decoded message:", repr(open("out.txt").read()) original: 'life of brian' encoded message: 'bGlmZSBvZiBicmlhbg==\012' decoded message: 'life of brian'
4-20
The encodestring and decodestring functions convert between strings instead. They're currently implemented as wrappers on top of encode and decode, using StringIO objects for input and output. Example: Using the base64 module to encode strings
# File:base64-example-2.py import base64 MESSAGE = "life of brian" data = base64.encodestring(MESSAGE) original_data = base64.decodestring(data) print "original:", repr(MESSAGE) print "encoded data:", repr(data) print "decoded data:", repr(original_data) original: 'life of brian' encoded data: 'bGlmZSBvZiBicmlhbg==\012' decoded data: 'life of brian'
Here's how to convert a user name and a password to a HTTP basic authentication string. Note that you don't really have to work for the NSA to be able to decode this format... Example: Using the base64 module for basic authentication
# File:base64-example-3.py import base64 def getbasic(user, password): # basic authentication (according to HTTP) return base64.encodestring(user + ":" + password) print getbasic("Aladdin", "open sesame") 'QWxhZGRpbjpvcGVuIHNlc2FtZQ=='
Python Standard Library: Data Representation Finally, here's a small utility that converts a GIF image to a Python script, for use with the Tkinter library: Example: Using the base64 module to wrap GIF images for Tkinter
# File:base64-example-4.py import base64, sys if not sys.argv[1:]: print "Usage: gif2tk.py giffile >pyfile" sys.exit(1) data = open(sys.argv[1], "rb").read() if data[:4] != "GIF8": print sys.argv[1], "is not a GIF file" sys.exit(1) print print print print print print print '# generated from', sys.argv[1], 'by gif2tk.py' 'from Tkinter import PhotoImage' 'image = PhotoImage(data="""' base64.encodestring(data), '""")'
4-21
# generated from samples/sample.gif by gif2tk.py from Tkinter import PhotoImage image = PhotoImage(data=""" R0lGODlhoAB4APcAAAAAAIAAAACAAICAAAAAgIAAgACAgICAgAQEBIwEBIyMBJRUlISE/L RUBAQE ... AjmQBFmQBnmQCJmQCrmQDNmQDvmQEBmREnkRAQEAOw== """)
4-22
4-23
As this example shows, non-US characters are mapped to an '=' followed by two hexadecimal digits. So is the '=' character itself ("=3D"), as well as whitespace at the end of lines ("=20"). Everything else looks just like before. So provided you don't use too many weird characters, the encoded string is nearly as readable as the original. (Europeans generally hate this encoding, and strongly believe that certain US programmers deserve to be slapped in the head with a huge great fish to the jolly music of Edward German...)
4-24
The uu module
The UU encoding scheme is used to convert arbitrary binary data to plain text. This format is quite popular on the Usenet, but is slowly being superseded by base64 encoding. An UU encoder takes groups of three bytes (24 bits), and converts each group to a sequence of four printable characters (6 bits per character), using characters from chr(32) (space) to chr(95). Including the length marker and line feed characters, UU encoding typically expands data by 40%. An encoded data stream starts with a begin line, which also includes the file privileges (the Unix mode field, as an octal number) and the filename, and ends with an end line:
begin 666 sample.jpg M_]C_X 02D9)1@ ! 0 0 ! #_VP!# @&!@<&!0@'!P<)'0@*#!0-# L+ ...more lines like this... end
The uu module provides two functions, encode and decode: encode(infile, outfile, filename) encodes data from the input file and writes it to the output file. The input and output file arguments can be either filenames or file objects. The third argument is used as filename in the begin field. Example: Using the uu module to encode a binary file
# File:uu-example-1.py import uu import os, sys infile = "samples/sample.jpg" uu.encode(infile, sys.stdout, os.path.basename(infile)) begin 666 sample.jpg M_]C_X 02D9)1@ ! 0 0 ! #_VP!# @&!@<&!0@'!P<)"0@*#!0-# L+ M#!D2$P\4'1H?'AT:'!P@)"XG("(L(QP<*#<I+# Q-#0T'R<Y/3@R/"XS-#+_ MVP!# 0D)"0P+#!@-#1@R(1PA,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R M,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C+_P 1" " ( # 2( A$! Q$!_\0 M'P 04! 0$! 0$ $" P0%!@<("0H+_\0 M1 @$# P($ P4%
4-25
decode(infile, outfile) decodes uu-encoded data from the input text file, and writes it to the output file. Again, both arguments can be either filenames or file objects. Example: Using the uu module to decode a uu-encoded file
# File:uu-example-2.py import uu import StringIO infile = "samples/sample.uue" outfile = "samples/sample.jpg" # # decode fi = open(infile) fo = StringIO.StringIO() uu.decode(fi, fo) # # compare with original data file data = open(outfile, "rb").read() if fo.getvalue() == data: print len(data), "bytes ok"
4-26