Sie sind auf Seite 1von 41

A Seminar Report On Introduction to Video Compression Techniques And MPEG-4

Undertaken at BMSCE, Bengaluru Submitted in partial fulfillment of the requirement for the award of the degree of

BACHELOR OF ENGINEERING In ELECTRONICS AND COMMUNICATION Submitted by UJWAL KAMATH K USN: 1BM08EC112 Under the guidance of K VIJAYA and ASHWINI V Department of ECE

Department of Electronics and Communication B.M.S.C.E, BULL TEMPLE ROAD, Bangalore-19

CERTIFICATE SEPTEMBER 2011 - JANUARY 2011 AUTONOMOUS Under VISWESWARAIAH TECHNOLOGICAL UNIVERSITY BMS College of Engineering Bengaluru

This is to certify that seminar titled Introduction to Video Compression Techniques And MPEG-4 Has been successfully completed by UJWAL KAMATH K 1BM08EC112 At BMSCE, Bengaluru

in partial fulfillment for the award of degree in Bachelor of Engineering in ELECTRONICS AND COMMUNICATION Of Visweswaraiah Tecnological University during the session September 2011-January 2012 Internal Guide: K VIJAYA ASHWINI V 2 HOD: Dr Seshachalam D

CONTENTS
1.Introduction.. 05 2.The Basics of Compression.. 05 3.Standardized Organizations.. 06 4. An Overview of Compression Formats 08 5. Introduction to MPEG-4 Video Compression.. 13 6. Scope and features of the MPEG-4 standard. 14 7. Coded representation of media objects.. 15 8. Versions in MPEG-4.. 17 9. Major Functionalities in MPEG-4.. 17 10. MPEG-4 Parts.. 23 11. Detailed Technical Description of MPEG-4. . 25 12. Structure Of The Tools For Representing Natural Video 28 13. Support For Conventional And Content-Based Functionalities 29 14. The MPEG-4 Video Image And Coding Scheme.. 30 15. Advantages. 30 16. Current Developments 33 17. Conclusions. 35 18. Future Options For MPEG-4... 37 19. References....................................................................................................................... 37

Introduction to Video Compression Techniques And MPEG-4


1.Introduction
When an ordinary analog video sequence is digitized according to the standard CCIR 601, it can consume as much as 165 Mbps, which is 165 million bits every second. With most surveillance applications infrequently having to share the network with other data intensive applications, this is very rarely the bandwidth available. To circumvent this problem, a series of techniques called picture and video compression techniques have been derived to reduce this high bit-rate. Compression, is the process of encoding information using fewer bits than the original representation would use. Compression is useful because it helps reduce the consumption of expensive resources, such as hard disk space or transmission bandwidth. On the downside, compressed data must be decompressed to be used, and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it is being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a lossy compression scheme), and the computational resources required to compress and uncompress the data.

2.The Basics of Compression


Compression basically means reducing image data. As mentioned previously, a digitized analog video sequence can comprise of up to 165 Mbps of data. To reduce the media overheads for distributing these sequences, the following techniques are commonly employed to achieve desirable reductions in image data: > > > > Reduce color nuances within the image. Reduce the color resolution with respect to the prevailing light intensity. Remove small, invisible parts, of the picture. Compare adjacent images and remove details that are unchanged between two images.

The first three are image based compression techniques, where only one frame is evaluated and compressed at a time. The last one is or video compression technique where different adjacent frames are compared as a way to further reduced the image data. All of these techniques are based on an accurate understanding of how the human brain and eyes work together to form a complex visual system.
5

Compression can be divided into two categories. Lossless compression. Lossy compression. Lossless compression is a class of algorithms that will allow for the exact original data to be reconstructed from the compressed data. That means that a limited amount of techniques are made available for the data reduction, and the result is limited reduction of data. When image/video quality is valued above file size, lossless algorithms are typically chosen. GIF is an example of lossless image compression, but is because of its limited abilities not relevant in video surveillance. Lossy compression algorithms take advantage of the inherent limitations of the human eye and discard invisible information. Most lossy compression algorithms allow for variable quality levels (compression) and as these levels are increased, file size is reduced. At the highest compression levels, video deterioration becomes noticeable as "compression artifacting".

The following are the some of the lossless and lossy data compression techniques: Lossless coding techniques Run length encoding Huffman encoding Arithmetic encoding Entropy coding Area coding LempelZiv coding

Lossy coding techniques Predictive coding Transform coding (FT/DCT/Wavelets)

3.Standardized Organizations
There are two important organizations that develop image and video compression standards: International Telecommunications Union (ITU) and International Organization for Standardization (ISO). Formally, ITU is not a standardization organization. ITU releases its documents as recommendations, for example ITU-R Recommendation BT.601 for digital video. ISO is a formal standardization organization, and it further cooperates with International Electro-technical
6

Commission (IEC) for standards within areas such as IT. The latter organizations are often referred to as a single body using ISO/IEC. The fundamental difference is that ITU stems from the telecommunications world, and has chiefly dealt with standards relating to telecommunications whereas ISO is a general standardization organization and IEC is a standardization organization dealing with electronic and electrical standards. Lately however,following the ongoing convergence of communications and media and with terms such as triple play being used (meaning Internet, television and telephone services over the same connection), the organizations, and their members one of which is Axis Communications have experienced increasing overlap in their standardization efforts.

3.1 Two basic standards: JPEG and MPEG The two basic compression standards are JPEG and MPEG. In broad terms, JPEG is associated with still digital pictures, whilst MPEG is dedicated to digital video sequences. But the traditional JPEG (and JPEG 2000) image formats also come in flavors that are appropriate for digital video: Motion JPEG and Motion JPEG 2000. The group of MPEG standards that include the MPEG 1, MPEG-2, MPEG-4 and H.264 formats have some similarities, as well as some notable differences. One thing they all have in common is that they are International Standards set by the ISO (International Organization for Standardization) and IEC (International Electro-technical Commission) with contributors from the US, Europe and Japan among others. They are also recommendations proposed by the ITU (International Telecommunication Union), which has further helped to establish them as the globally accepted de facto standards for digital still picture and video coding. Within ITU, the Video Coding Experts Group (VCEG) is the sub group that has developed for example the H.261 and H.263 recommendations for video-conferencing over telephone lines. The foundation of the JPEG and MPEG standards was started in the mid-1980s when a group called the Joint Photographic Experts Group (JPEG) was formed. With a mission to develop a standard for color picture compression, the groups first public contribution was the release of the first part of the JPEG standard, in 1991. Since then the JPEG group has continued to work on both the original JPEG standard and the JPEG 2000 standard. In the late 1980s the Motion Picture Experts Group (MPEG) was formed with the purpose of deriving a standard for the coding of moving pictures and audio. It has since produced the standards for MPEG 1,MPEG-2, and MPEG-4 as well as standards not concerned with the actual coding of multimedia, such as MPEG-7 and MPEG-21.

4. An Overview of Compression Formats


JPEG The JPEG standard, ISO/IEC 10918, is the single most widespread picture compression format of today. It offers the flexibility to either select high picture quality with fairly high compression ratio or to get a very high compression ratio at the expense of a reasonable lower picture quality. Systems, such as cameras and viewers, can be made inexpensive due to the low complexity of the technique. The artifacts show the blockiness as seen in Figure 1. The blockiness appears when the compression ratio is pushed too high. In normal use, a JPEG compressed picture shows no visual difference to the original uncompressed picture. JPEG image compression contains a series of advanced techniques. The main one that does the real image compression is the Discrete Cosine Transform (DCT) followed by a quantization that removes the redundant information (the invisible parts).

Original image

JPEG compressed picture

Figure 1.

Motion JPEG A digital video sequence can be represented as a series of JPEG pictures. The advantages are the same as with single still JPEG pictures flexibility both in terms of quality and compression ratio. The main disadvantage of Motion JPEG (a.k.a. MJPEG) is that since it uses only a series of still pictures it makes no use of video compression techniques. The result is a lower
8

compression ratio for video sequences compared to real video compression techniques like MPEG. The benefit is its robustness with no dependency between the frames, which means that, for example, even if one frame is dropped during transfer, the rest of the video will be unaffected. JPEG 2000 JPEG 2000 was created as the follow-up to the successful JPEG compression, with better compression ratios. The basis was to incorporate new advances in picture compression research into an international standard. Instead of the DCT transformation, JPEG 2000, ISO/IEC 15444, uses the Wavelet transformation. The advantage of JPEG 2000 is that the blockiness of JPEG is removed, but replaced with a more overall fuzzy picture, as can be seen in Figure 2.

Original image

JPEG 2000 compressed picture

Figure 2. Whether this fuzziness of JPEG 2000 is preferred compared to the blockiness of JPEG is a matter of personal preference. Regardless, JPEG 2000 never took off for surveillance applications and is still not widely supported in web browsers either.

Motion JPEG 2000 As with JPEG and Motion JPEG, JPEG 2000 can also be used to represent a video sequence. The advantages are equal to JPEG 2000, i.e., a slightly better compression ratio compared to JPEG but at the price of complexity. The disadvantage reassembles that of Motion JPEG. Since it is a still picture compression technique it does not take any advantage of the video sequence compression. This results in a lower compression ration compared to real video compression techniques. The
9

viewing experience if a video stream in Motion JPEG 2000 is generally considered not as good as a Motion JPEG stream, and Motion JPEG 2000 has never been any success as a video compression technique.

H.261/H.263 The H.261 and H.263 are not International Standards but only Recommendations of the ITU. They are both based on the same technique as the MPEG standards and can be seen as simplified versions of MPEG video compression. They were originally designed for video-conferencing over telephone lines, i.e. low bandwidth. However, it is a bit contradictory that they lack some of the more advanced MPEG techniques to really provide efficient bandwidth use. The conclusion is therefore that H.261 and H.263 are not suitable for usage in general digital video coding.

MPEG-1 The first public standard of the MPEG committee was the MPEG-1, ISO/IEC 11172, which first parts were released in 1993. MPEG-1 video compression is based upon the same technique that is used in JPEG. In addition to that it also includes techniques for efficient coding of a video sequence.

Figure 3. A three-picture JPEG video sequence. Consider the video sequence displayed in Figure 3. The picture to the left is the first picture in the sequence followed by the picture in the middle and then the picture to the right. When displayed, the video sequence shows a man running from right to left with a house that stands still. In Motion JPEG/Motion JPEG 2000 each picture in the sequence is coded as a separate unique picture resulting in the same sequence as the original one.

10

In MPEG video only the new parts of the video sequence is included together with information of the moving parts. The video sequence of Figure 3 will then appear as in Figure 4. But this is only true during the transmission of the video sequence to limit the bandwidth consumption. When displayed it appears as the original video sequence again.

Figure 4. A three-picture MJPEG video sequence. MPEG-1 is focused on bit-streams of about 1.5 Mbps and originally for storage of digital video on CDs.The focus is on compression ratio rather than picture quality. It can be considered as traditional VCR quality but digital instead. It is important to note that the MPEG-1 standard, as well as MPEG-2, MPEG-4 and H.264 that are described below, defines the syntax of an encoded video stream together with the method of decoding this bitstream. Thus, only the decoder is actually standardized. An MPEG encoder can be implemented in different way and a vendor may choose to implement only a subset of the syntax, providing it provides a bitstream that is compliant with the standard. This allows for optimization of the technology and for reducing complexity in implementations. However, it also means that there are no guarantees for quality different vendors implement MPEG encoders that produce video streams that differ in quality.

MPEG-2 The MPEG-2 project focused on extending the compression technique of MPEG-1 to cover larger pictures and higher quality at the expense of a higher bandwidth usage. MPEG-2, ISO/IEC 13818, also provides more advanced techniques to enhance the video quality at the same bit-rate. The expense is the need for far more complex equipment. As a note, DVD movies are compressed using the techniques of MPEG-2. MPEG-3 The next version of the MPEG standard, MPEG-3 was designed to handle HDTV, however, it was discovered that the MPEG-2 standard could be slightly modified and then
11

achieve the same results as the planned MPEG-3 standard. Consequently, the work on MPEG-3 was discontinued.

MPEG-4 The next generation of MPEG, MPEG-4, is based upon the same technique as MPEG-1 and MPEG-2. Once again, the new standard focused on new applications. The most important new features of MPEG-4, ISO/IEC 14496, concerning video compression are the support of even lower bandwidth consuming applications, e.g. mobile devices like cell phones, and on the other hand applications with extremely high quality and almost unlimited bandwidth. In general the MPEG-4 standard is a lot wider than the previous standards. It also allows for any frame rate, while MPEG-2 was locked to 25 frames per second in PAL and 30 frames per second in NTSC. When MPEG-4, is mentioned in surveillance applications today it is usually MPEG-4 part 2 that is referred to. This is the classic MPEG-4 video streaming standard, a.k.a. MPEG-4 Visual. Some network video streaming systems specify support for MPEG-4 short header, which is an H.263 video stream encapsulated with MPEG-4 video stream headers. MPEG-4 short header does not take advantage of any of the additional tools specified in the MPEG-4 standard, which gives a lower quality level than both MPEG-2 and MPEG-4 at a given bit-rate.

H.264 H.264 is the latest generation standard for video encoding. This initiative has many goals. It should provide good video quality at substantially lower bit rates than previous standards and with better error robustness or better video quality at an unchanged but rate. The standard is further designed to give lower latency as well as better quality for higher latency. In addition, all these improvements compared to previous standards were to come without increasing the complexity of design so much that it would be impractical or expensive to build applications and systems. An additional goal was to provide enough flexibility to allow the standard to be applied to a wide variety of applications: for both low and high bit rates, for low and high resolution video, and with high and low demands on latency. Indeed, a number of applications with different requirements have been identified for H.264: > Entertainment video including broadcast, satellite, cable, DVD, etc (1-10 Mbps, high latency) > Telecom services (<1Mbps, low latency) > Streaming services (low bit-rate, high latency) > And others

12

As a note, DVD players for high-definition DVD formats such as HD-DVD and Blu-ray support movies encoded with H.264.

MPEG-7 MPEG-7 is a different kind of standard as it is a multimedia content description standard, and does not deal with the actual encoding of moving pictures and audio. With MPEG-7, the content of the video (or any other multimedia) is described and associated with the content itself, for example to allow fast and efficient searching in the material. MPEG-7 uses XML to store metadata, and it can be attached to a time code in order to tag particular events in a stream. Although MPEG-7 is independent of the actual encoding technique of the multimedia, the representation that is defined within MPEG-4, i.e. the representation of audio-visual data in terms of objects, is very well suited to the MPEG-7 standard. MPEG-7 is relevant for video surveillance since it could be used for example to tag the contents and events of video streams for more intelligent processing in video management software or video analytics applications.

MPEG-21 MPEG-21 is a standard that defines means of sharing digital rights, permissions, and restrictions for digital content. MPEG-21 is an XML-based standard, and is developed to counter illegitimate distribution of digital content. MPEG-21 is not particularly relevant for video surveillance situations.

Note: Out of all the above compression formats MPEG-4 is a fairly complex and comprehensive standard used now-a-days.

5. Introduction to MPEG-4 Video Compression


MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), the committee that also developed the Emmy Award winning standards known as MPEG-1 and MPEG-2. These standards made interactive video on CD-ROM, DVD and Digital Television possible. MPEG-4 is the result of another international effort involving hundreds of researchers and engineers from all over the world. MPEG-4, with formal as its ISO/IEC designation 'ISO/IEC 14496', was finalized in October 1998 and became an International Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2
13

were frozen at the end of 1999, to acquire the formal International Standard Status early in 2000. Several extensions were added since and work on some specific work-items work is still in progress.

MPEG-4 builds on the proven success of three fields:


Digital television Interactive graphics applications (synthetic content) Interactive multimedia (World Wide Web, distribution of and access to content).

6. Scope and features of the MPEG-4 standard


The MPEG-4 standard provides a set of technologies to satisfy the needs of authors, service providers and end users alike.

For authors, MPEG-4 enables the production of content that has far greater reusability, has greater flexibility than is possible today with individual technologies such as digital television, animated graphics, World Wide Web (WWW) pages and their extensions. Also, it is now possible to better manage and protect content owner rights.

For network service providers MPEG-4 offers transparent information, which can be interpreted and translated into the appropriate native signaling messages of each network with the help of relevant standards bodies. The foregoing, however, excludes Quality of Service considerations, for which MPEG-4 provides a generic QoS descriptor for different MPEG-4 media. The exact translations from the QoS parameters set for each media to the network QoS are beyond the scope of MPEG-4 and are left to network providers. Signaling of the MPEG-4 media QoS descriptors end-to-end enables transport optimization in heterogeneous networks.

For end users, MPEG-4 brings higher levels of interaction with content, within the limits set by the author. It also brings multimedia to new networks, including those employing relatively low bit rate, and mobile ones. An MPEG4 applications document exists on the MPEG Home page

14

(mpeg.chiariglione.org), which describes many end user applications, including interactive multimedia broadcast and mobile communications. For all parties involved, MPEG seeks to avoid a multitude of proprietary, noninterworking formats and players.

7. Coded representation of media objects


MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive media objects, such as:

Still images (e.g. as a fixed background); Video objects (e.g. a talking person - without the background; Audio objects (e.g. the voice associated with that person, background music);

MPEG-4 standardizes a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional. In addition to the media objects mentioned above and shown in Figure 1, MPEG-4 defines the coded representation of objects such as:

Text and graphics; Talking synthetic heads and associated text used to synthesize the speech and animate the head; animated bodies to go with the faces; Synthetic sound.

A media object in its coded form consists of descriptive elements that allow handling the object in an audiovisual scene as well as of associated streaming data, if needed. It is important to note that in its coded form, each media object can be represented independent of its surroundings or background. The coded representation of media objects is as efficient as possible while taking into account the desired functionalities. Examples of such functionalities are error robustness, easy extraction and editing of an object, or having an object available in a scaleable form.

15

Figure 1 gives an example that highlights the way in which an audiovisual scene in MPEG-4 is composed of individual objects. The figure contains compound AVOs that group elementary AVOs together. As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound AVO, containing both the aural and visual components of a talking person. Such grouping allows authors to construct complex scenes, and enables consumers to manipulate meaningful (sets of) objects. The MPEG-4 systems layer facilitates the use of different tools and signaling this, and thus codecs according to existing standards can be accommodated. Therefore MPEG-4 allows the use of several highly optimized coders, such as those standardized by the ITU-T, which were designed to meet a specific set of requirements. Each of the coders is designed to operate in a stand-alone mode with its own bitstream syntax. Additional functionalities are realized both within individual coders, and by means of additional tools around the coders. An example of a functionality within an individual coder is pitch change within the parametric coder.

16

8. Versions in MPEG-4
MPEG-4 Version 1 was approved by MPEG in December 1998; version 2 was frozen in December 1999. After these two major versions, more tools were added in subsequent amendments that could be qualified as versions, even though they are harder to recognize as such. Recognizing the versions is not too important, however; it is more important to distinguish Profiles. Existing tools and profiles from any version are never replaced in subsequent versions; technology is always added to MPEG-4 in the form of new profiles. Figure 2 below depicts the relationship between the versions. Version 2 is a backward compatible extension of Version 1, and version 3 is a backward compatible extension of Version 2 and so on. The versions of all major parts of the MPEG-4 Standard (Systems, Audio, Video, DMIF) were synchronized; after that, the different parts took their own paths.

Figure 2 - relation between MPEG-4 Versions The Systems layer of Version later versions is backward compatible with all earlier versions. In the area of Systems, Audio and Visual, new versions add Profiles, do not change existing ones. In fact, it is very important to note that existing systems will always remain compliant, because Profiles will never be changed in retrospect, and neither will the Systems Syntax, at least not in a backward-incompatible way.

9. Major Functionalities in MPEG-4


This section contains, in an itemized fashion, the major functionalities that the different parts of the MPEG-4 Standard offers in the finalized MPEG-4 Version 1. Description of the functionalities can be found in the following sections.

17

9.1 Transport In principle, MPEG-4 does not define transport layers. In a number of cases, adaptation to a specific existing transport layer has been defined:

Transport over MPEG-2 Transport Stream (this is an amendment to MPEG-2 Systems) Transport over IP (In cooperation with IETF, the Internet Engineering Task Force)

9.2 DMIF DMIF, or Delivery Multimedia Integration Framework, is an interface between the application and the transport, that allows the MPEG-4 application developer to stop worrying about that transport. A single application can run on different transport layers when supported by the right DMIF instantiation. MPEG-4 DMIF supports the following functionalities:

A transparent MPEG-4 DMIF-application interface irrespective of whether the peer is a remote interactive peer, broadcast or local storage media. Control of the establishment of FlexMux channels Use of homogeneous networks between interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN. Support for mobile networks, developed together with ITU-T User Commands with acknowledgment messages. Management of MPEG-4 Sync Layer information.

9.3 Systems As explained above, MPEG-4 defines a toolbox of advanced compression algorithms for audio and visual information. The data streams (Elementary Streams, ES) that result from the coding process can be transmitted or stored separately, and need to be composed so as to create the actual multimedia presentation at the receiver side. The systems part of the MPEG-4 addresses the description of the relationship between the audio-visual components that constitute a scene. The relationship is described at two main levels.

The Binary Format for Scenes (BIFS) describes the spatio-temporal arrangements of the objects in the scene. Viewers may have the possibility of interacting with the objects, e.g. by rearranging them on the scene or by changing their own point of view in a 3D virtual environment. The scene description provides a rich set of nodes for 2-D and 3-D composition operators and graphics primitives. At a lower level, Object Descriptors (ODs) define the relationship between the Elementary Streams pertinent to each object (e.g the audio and the video stream of a participant to a videoconference) ODs also provide additional information such as the

18

URL needed to access the Elementary Steams, the characteristics of the decoders needed to parse them, intellectual property and others. Other issues addressed by MPEG-4 Systems:

A standard file format supports the exchange and authoring of MPEG-4 content Interactivity, including: client and server-based interaction; a general event model for triggering events or routing user actions; general event handling and routing between objects in the scene, upon user or scene triggered events. Java (MPEG-J) is used to be able to query to terminal and its environment support and there is also a Java application engine to code 'MPEGlets'. A tool for interleaving of multiple streams into a single stream, including timing information (FlexMux tool). A tool for storing MPEG-4 data in a file (the MPEG-4 File Format, MP4) Interfaces to various aspects of the terminal and networks, in the form of Java APIs (MPEG J) Transport layer independence. Mappings to relevant transport protocol stacks, like (RTP)/UDP/IP or MPEG-2 transport stream can be or are being defined jointly with the responsible standardization bodies. Text representation with international language support, font and font style selection, timing and synchronization. The initialization and continuous management of the receiving terminals buffers. Timing identification, synchronization and recovery mechanisms. Datasets covering identification of Intellectual Property Rights relating to media objects

9.4 Audio MPEG-4 Audio facilitates a wide variety of applications which could range from intelligible speech to high quality multichannel audio, and from natural sounds to synthesized sounds. In particular, it supports the highly efficient representation of audio objects consisting of: 9.4.1 General Audio Signals Support for coding general audio ranging from very low bitrates up to high quality is provided by transform coding techniques. With this functionality, a wide range of bitrates and bandwidths is covered. It starts at a bitrate of 6 kbit/s and a bandwidth below 4 kHz and extends to broadcast quality audio from mono up to multichannel. High quality can be achieved with low delays. Parametric Audio Coding allows sound manipulation at low speeds. Fine Granularity Scalability (or FGS, scalability resolution down to 1 kbit/s per channel)

9.4.2 Speech signals Speech coding can be done using bitrates from 2 kbit/s up to 24 kbit/s using the speech coding tools. Lower bitrates, such as an average of 1.2 kbit/s, are also possible when variable rate
19

coding is allowed. Low delay is possible for communications applications. When using the HVXC tools, speed and pitch can be modified under user control during playback. If the CELP tools are used, a change of the playback speed can be achieved by using and additional tool for effects processing. 9.4.3 Synthetic Audio MPEG-4 Structured Audio is a language to describe 'instruments' (little programs that generate sound) and 'scores' (input that drives those objects). These objects are not necessarily musical instruments, they are in essence mathematical formulae, that could generate the sound of a piano, that of falling water or something 'unheard' in nature. 9.4.4 Synthesized Speech Scalable TTS coders bit rate range from 200 bit/s to 1.2 Kbit/s which allows a text, or a text with prosodic parameters (pitch contour, phoneme duration, and so on), as its inputs to generate intelligible synthetic speech. 9.5 Visual The MPEG-4 Visual standard allows the hybrid coding of natural (pixel based) images and video together with synthetic (computer generated) scenes. This enables, for example, the virtual presence of videoconferencing participants. To this end, the Visual standard comprises tools and algorithms supporting the coding of natural (pixel based) still images and video sequences as well as tools to support the compression of synthetic 2-D and 3-D graphic geometry parameters (i.e. compression of wire grid parameters, synthetic text). The subsections below give an itemized overview of functionalities that the tools and algorithms of in the MPEG-4 visual standard. 9.5.1 Formats Supported The following formats and bitrates are be supported by MPEG-4 Visual :

bitrates: typically between 5 kbit/s and more than 1 Gbit/s Formats: progressive as well as interlaced video Resolutions: typically from sub-QCIF to 'Studio' resolutions (4k x 4k pixels)

9.5.2 Compression Efficiency

For all bit rates addressed, the algorithms are very efficient. This includes the compact coding of textures with a quality adjustable between "acceptable" for very high compression ratios up to "near lossless". Efficient compression of textures for texture mapping on 2-D and 3-D meshes. Random access of video to allow functionalities such as pause, fast forward and fast reverse of stored video.
20

9.5.3 Content-Based Functionalities


Content-based coding of images and video allows separate decoding and reconstruction of arbitrarily shaped video objects. Random access of content in video sequences allows functionalities such as pause, fast forward and fast reverse of stored video objects. Extended manipulation of content in video sequences allows functionalities such as warping of synthetic or natural text, textures, image and video overlays on reconstructed video content. An example is the mapping of text in front of a moving video object where the text moves coherently with the object.

9.5.4 Scalability of Textures, Images and Video


Complexity scalability in the encoder allows encoders of different complexity to generate valid and meaningful bitstreams for a given texture, image or video. Complexity scalability in the decoder allows a given texture, image or video bitstream to be decoded by decoders of different levels of complexity. The reconstructed quality, in general, is related to the complexity of the decoder used. This may entail that less powerful decoders decode only a part of the bitstream. Spatial scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display textures, images and video objects at reduced spatial resolution. A maximum of 11 levels of spatial scalability are supported in socalled 'fine-granularity scalability', for video as well as textures and still images. Temporal scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display video at reduced temporal resolution. A maximum of three levels are supported. Quality scalability allows a bitstream to be parsed into a number of bitstream layers of different bitrate such that the combination of a subset of the layers can still be decoded into a meaningful signal. The bitstream parsing can occur either during transmission or in the decoder. The reconstructed quality, in general, is related to the number of layers used for decoding and reconstruction. Fine Grain Scalability a combination of the above in fine grain steps, up to 11 steps

9.5.5 Shape and Alpha Channel Coding

Shape coding assists the description and composition of conventional images and video as well as arbitrarily shaped video objects. Applications that benefit from binary shape maps with images are content-based image representations for image databases, interactive games, surveillance, and animation. There is an efficient technique to code binary shapes. A binary alpha map defines whether or not a pixel belongs to an object. It can be on or off. Gray Scale or alpha Shape Coding. An alpha plane defines the transparency of an object, which is not necessarily uniform; it can vary over the object, so that, e.g., edges are more transparent (a technique called feathering). Multilevel alpha maps are frequently used to blend different layers of image sequences. Other applications that benefit from

21

associated binary alpha maps with images are content-based image representations for image databases, interactive games, surveillance, and animation. 9.5.6 Robustness in Error Prone Environments Error resilience allows accessing image and video over a wide range of storage and transmission media. This includes the useful operation of image and video compression algorithms in error-prone environments at low bit-rates (i.e., less than 64 Kbps). There are tools that address both the band-limited nature and error resiliency aspects of access over wireless networks. 9.5.7 Face and Body Animation The Face and Body Animation tools in the standard allow sending parameters that can define, calibrate and animate synthetic faces and bodies. These models themselves are not standardized by MPEG-4, only the parameters are, although there is a way to send, e.g., a welldefined face to a decoder. The tools include:

Definition and coding of face and body animation parameters (model independent): Feature point positions and orientations to animate the face and body definition meshes Visemes, or visual lip configurations equivalent to speech phonemes Definition and coding of face and body definition parameters (for model calibration): 3-D feature point positions 3-D head calibration meshes for animation Personal characteristics Facial texture coding

9.5.8 Coding of 2-D Meshes with Implicit Structure 2D mesh coding includes:

Mesh-based prediction and animated texture transfiguration 2-D Delaunay or regular mesh formalism with motion tracking of animated objects Motion prediction and suspended texture transmission with dynamic meshes. Geometry compression for motion vectors: 2-D mesh compression with implicit structure & decoder reconstruction

22

9.5.9 Coding of 3-D Polygonal Meshes MPEG-4 provides a suite of tools for coding 3-D polygonal meshes. Polygonal meshes are widely used as a generic representation of 3-D objects. The underlying technologies compress the connectivity, geometry, and properties such as shading normals, colors and texture coordinates of 3-D polygonal meshes. The Animation Framework eXtension (AFX) will provide more elaborate tools for 2D and 3D synthetic objects.

10. MPEG-4 Parts


Some of the main MPEG-4 parts and their functions are listed below. MPEG-4 Part 1 Number: ISO/IEC 14496-1 First Edition: 1999 Last Edition: 2010 Latest Amendment: 2010 Title: Systems Description: Describes synchronization and multiplexing of video and audio. For example the MPEG-4 file format version 1 (obsoleted by version 2 defined in MPEG-4 Part 14). The functionality of a transport protocol stack for transmitting and/or storing content complying with ISO/IEC 14496 is not within the scope of 14496-1 and only the interface to this layer is considered (DMIF). Information about transport of MPEG-4 content is defined e.g. in MPEG-2 Transport Stream RTP Audio Video Profiles and others. MPEG-4 Part 2 Number: ISO/IEC 14496-2 First Edition: 1999 Last Edition: 2004 Latest Amendment: 2009 Title: Visual Description: A compression codec for visual data (video, still textures, synthetic images, etc.). One of the many "profiles" in Part 2 is the Advanced Simple Profile (ASP).

MPEG-4 Part 3 Number: ISO/IEC 14496-3 First Edition: 1999 Last Edition: 2009
23

Latest Amendment: 2010 Title: Audio Description: A set of compression codecs for perceptual coding of audio signals, including some variations of Advanced Audio Coding (AAC) as well as other audio/speech coding formats and tools (such as Audio Lossless Coding (ALS),Scalable Lossless Coding (SLS),Structured Audio, Text-To-Speech Interface (TTSI), HVXC, CELP and others).

MPEG-4 Part 6 Number: ISO/IEC 14496-6 First Edition: 1999 Last Edition: 2000 Title: Delivery Multimedia Integration Framework (DMIF)

MPEG-4 Part 10 Number: ISO/IEC 14496-10 First Edition: 2003 Last Edition: 2009 Latest Amendment: 2010 Title: Advanced Video Coding (AVC) Description: A codec for video signals which is technically identical to the ITUT H.264standard. MPEG-4 Part 14 Number: ISO/IEC 14496-14 First Edition: 2003 Last Edition: 2003 Latest Amendment: 2010 Title: MP4 file format Description: It is also known as "MPEG-4 file format version 2". The designated container file format for MPEG-4 content, which is based on Part 12. It revises and completely replaces Clause 13 of ISO/IEC 14496-1 (MPEG-4 Part 1: Systems), in which the MPEG-4 file format was previously specified. MPEG-4 Part 17 Number: ISO/IEC 14496-17 First Edition: 2006 Last Edition: 2006 Title: Streaming text format. Description: Timed Text subtitle format.

24

MPEG-4 Part 20 Number: ISO/IEC 14496-20 First Edition: 2006 Last Edition: 2008 Latest Amendment: 2009 Title: Lightweight Application Scene Representation (LASeR) and Simple Aggregation Format Description: LASeR requirements (compression efficiency, code and memory footprint) are fulfilled by building upon the existing the Scalable Vector Graphics (SVG) format defined by the World Wide Web Consortium. MPEG-4 Part 22 Number: ISO/IEC 14496-22 First Edition: 2007 Last Edition: 2009 Title: Open Font Format Description: OFFS is based on the Open Type version 1.4 font format specification, and is technically equivalent to that specification. Reached "CD" stage in July 2005, published as ISO standard in 2007.

11. Detailed Technical Description of MPEG-4.


11.1 SYNTHESIZED SOUND Decoders are also available for generating sound based on structured inputs. Text input is converted to speech in the Text-To-Speech (TTS) decoder, while more general sounds including music may be normatively synthesized. Synthetic music may be delivered at extremely low bitrates while still describing an exact sound signal. Text To Speech. TTS allows a text or a text with prosodic parameters (pitch contour, phoneme duration, and so on) as its inputs to generate intelligible synthetic speech. It includes the following functionalities. Speech synthesis using the prosody of the original speech Facial animation control with phoneme information. Trick mode functionality: pause, resume, jump forward/backward. International language support for text. International symbol support for phonemes. support for specifying age, gender, language and dialect of the speaker.

25

11.2 SYNTHETIC OBJECTS Synthetic objects form a subset of the larger class of computer graphics, as an initial focus the following visual synthetic objects will be described: Parametric descriptions of a) a synthetic description of human face and body b) animation streams of the face and body Static and Dynamic Mesh Coding with texture mapping Texture Coding for View Dependent applications

11.3 FACIAL ANIMATION The face is an object capable of facial geometry ready for rendering and animation. The shape, texture and expressions of the face are generally controlled by the bitstream containing instances of Facial Definition Parameter (FDP) sets and/or Facial Animation Parameter (FAP) sets. Upon construction, the Face object contains a generic face with a neutral expression. This face can already be rendered. It is also immediately capable of receiving the FAPs from the bitstream, which will produce animation of the face: expressions, speech etc. If FDPs are received, they are used to transform the generic face into a particular face determined by its shape and (optionally) texture. Optionally, a complete face model can be downloaded via the FDP set as a scene graph for insertion in the face node. The Face object can also receive local controls that can be used to modify the look or behavior of the face locally by a program or by the user. There are three possibilities of local control. First, by sending locally a set of FDPs to the Face the shape and/or texture can be changed. Second, a set of Amplification Factors can be defined, each factor corresponding to an animation parameter in the FAP set. The face object will apply these Amplification Factors to the FAPs, resulting in amplification or attenuation of selected facial actions. This feature can be used for example to amplify the visual effect of speech pronunciation for easier lip reading. The third local control is allowed through the definition of the Filter Function. This function, if defined, will be invoked by the Face object immediately before each rendering. The Face object passes the original FAP set to the Filter Function, which applies any modification to it and returns it to be used for the rendering. The Filter Function can include user interaction. It is also possible to use the Filter Function as a source of facial animation if there is no bitstream to control the face, e.g. in the case where the face is driven by a TTS system that in turn is driven uniquely by text coming through the bitstream. 11.4 BODY ANIMATION The Body is an object capable of producing virtual body models and animations in form of a set of 3D polygon meshes ready for rendering. Two sets of parameters are defined for the body: Body Definition Parameter (BDP) set, and Body Animation Parameter (BAP) set. The BDP set defines the set of parameters to transform the default body to a customized body with its

26

body surface, body dimensions, and (optionally) texture. The Body Animation Parameters (BAP)s, if correctly interpreted, will produce reasonably similar high level results in terms of body posture and animation on different body models, without the need to initialize or calibrate the model. Upon construction, the Body object contains a generic virtual human body with the default posture. This body can already be rendered. It is also immediately capable of receiving the BAPs from the bitstream, which will produce animation of the body. If BDPs are received, they are used to transform the generic body into a particular body determined by the parameters contents. Any component can be null. A null component is replaced by the corresponding default component when the body is rendered. The default posture is defined by standing posture. This posture is defined as follows: the feet should point to the front direction, the two arms should be placed on the side of the body with the palm of the hands facing inward. This posture also implies that all BAPs have default values. No assumption is made and no limitation is imposed on the range of motion of joints. In other words the human body model should be capable of supporting various applications, from realistic simulation of human motions to network games using simple human-like models.

11.5 2-D ANIMATED MESHES A 2D mesh is a tessellation (or partition) of a 2D planar region into polygonal patches. The vertices of the polygonal patches are referred to as the node points of the mesh. MPEG4 considers only triangular meshes where the patches are triangles. A 2D dynamic mesh refers to 2D mesh geometry and motion information of all mesh node points within a temporal segment of interest. Triangular meshes have long been used for efficient 3D object shape (geometry) modeling and rendering in computer graphics. 2D mesh modeling may be considered as projection of such 3D triangular meshes onto the image plane.A dynamic mesh is a forward tracking mesh, where the node points of the initial mesh track image features forward in time by their respective motion vectors. The initial mesh may be regular, or can be adapted to the image content, which is called a content-based mesh . 2D content-based mesh modeling then corresponds to non-uniform sampling of the motion field at a number of salient feature points (node points) along the contour and interior of a video object. Methods for selection and tracking of these node points are not subject to standardization. In 2D mesh based texture mapping, triangular patches in the current frame are deformed by the movements of the node points into triangular patches in the reference frame, and the texture inside each patch in the reference frame is warped onto the current frame using a parametric mapping, defined as a function of the node point motion vectors. For triangular meshes, the affine mapping is a common choice. The attractiveness of 2D mesh modeling originates from the fact that 2D meshes can be designed from a single view of an object without requiring range data, while maintaining several of the functionalities offered by 3D mesh modeling.

27

Video Object Manipulation Augmented reality: Merging virtual (computer generated) images with real moving images (video) to create enhanced display information. The computer generated images must remain in perfect registration with the moving real images (hence the need for tracking). Synthetic-object-transfiguration/animation: Replacing a natural video object in a video clip by another video object. The replacement video object may be extracted from another natural video clip or may be transfigured from a still image object using the motion information of the object to be replaced (hence the need for a temporally continuous motion representation). Spatio-temporal interpolation: Mesh motion modeling provides more robust motioncompensated temporal interpolation (frame rate up-conversion). Video Object Compression 2D mesh modeling may be used for compression if one chooses to transmit texture maps only at selected key frames and animate these texture maps (without sending any prediction error image) for the intermediate frames. This is also known as self-transfiguration of selected key frames using 2D mesh information.

12. STRUCTURE OF THE TOOLS FOR REPRESENTING NATURAL VIDEO


The MPEG-4 image and video coding algorithms will give an efficient representation of visual objects of arbitrary shape, with the goal to support so-called content-based functionalities. Next to this, it will support most functionalities already provided by MPEG-1 and MPEG-2, including the provision to efficiently compress standard rectangular sized image sequences at varying levels of input formats, frame rates, pixel depth, bit-rates, and various levels of spatial, temporal and quality scalability. A basic classification of the bit rates and functionalities currently provided by the MPEG4 Visual standard for natural images and video is depicted in Figure A below, with the attempt to cluster bit-rate levels versus sets of functionalities.

28

Figure A - Classification of the MPEG-4 Image and Video Coding Algorithms and Tools

13. SUPPORT FOR CONVENTIONAL AND CONTENT-BASED FUNCTIONALITIES


The MPEG-4 Video standard will support the decoding of conventional rectangular images and video as well as the decoding of images and video of arbitrary shape. This concept is illustrated in Figure B below.

Figure B - the VLBV Core and the Generic MPEG-4 Coder

The coding of conventional images and video is achieved similar to conventional MPEG1/2 coding and involves motion prediction/compensation followed by texture coding. For the content-based functionalities, where the image sequence input may be of arbitrary shape and location, this approach is extended by also coding shape and transparency information. Shape
29

may be either represented by an 8 bit transparency component - which allows the description of transparency if one VO is composed with other objects - or by a binary mask.

14. THE MPEG-4 VIDEO IMAGE AND CODING SCHEME


Figure C below outlines the basic approach of the MPEG-4 video algorithms to encode rectangular as well as arbitrarily shaped input image sequences. The basic coding structure involves shape coding (for arbitrarily shaped VOs) and motion compensation as well as DCT-based texture coding (using standard 8x8 DCT or shape adaptive DCT).

Figure C - Basic block diagram of MPEG-4 Video Coder

15. ADVANTAGES
An important advantage of the content-based coding approach taken by MPEG-4, is that the compression efficiency can be significantly improved for some video sequences by using appropriate and dedicated object-based motion prediction "tools" for each object in a scene. A number of motion prediction techniques can be used to allow efficient coding and flexible presentation of the objects:
30

How objects are grouped together: An MPEG-4 scene follows a hierarchical structure which can be represented as a directed acyclic graph. Each node of the graph is an AV object, as illustrated in Figure 12 (note that this tree refers back to Figure 1). The tree structure is not necessarily static; node attributes (e.g., positioning parameters) can be changed while nodes can be added, replaced, or removed. How objects are positioned in space and time: In the MPEG-4 model, audiovisual objects have both a spatial and a temporal extent. Each AV object has a local coordinate system. A local coordinate system for an object is one in which the object has a fixed spatio-temporal location and scale. The local coordinate system serves as a handle for manipulating the AV object in space and time. AV objects are positioned in a scene by specifying a coordinate transformation from the objects local coordinate system into a global coordinate system defined by one more parent scene description nodes in the tree.

15.1 USER INTERACTION MPEG-4 allows for user interaction with the presented content. This interaction can be separated into two major categories: client-side interaction and server-side interaction. Clientside interaction involves content manipulation which is handled locally at the end-users terminal, and can take several forms. In particular, the modification of an attribute of a scene description node, e.g., changing the position of an object, making it visible or invisible, changing the font size of a synthetic text node, etc., can be implemented by translating user events (e.g., mouse clicks or keyboard commands) to scene description updates. The commands can be processed by the MPEG-4 terminal in exactly the same way as if they originated from the original content source. As a result, this type of interaction does not require standardization.
31

15.2 ENHANCEMENTS IN VISUAL DATA CODING The MPEG-4 Visual standard will allow the hybrid coding of natural (pixel based) images and video together with synthetic (computer generated) scenes. This will, for example, allow the virtual presence of videoconferencing participants. To this end, the Visual standard will comprise tools and algorithms supporting the coding of natural (pixel based) still images and video sequences as well as tools to support the compression of synthetic 2-D and 3-D graphic geometry parameters (i.e. compression of wire grid parameters, synthetic text). The subsections below give an itemized overview of functionalities that the tools and algorithms of the MPEG-4 visual standard will support.

15.3 FORMATS SUPPORTED The following formats and bitrates will be supported: Bitrates: typically between 5 k bit/s and 4 M bit/s Formats: progressive as well as interlaced video Resolutions: typically from sub-QCIF to TV

15.4 COMPRESSION EFFICIENCY Efficient compression of video will be supported for all bit rates addressed. This includes the compact coding of textures with a quality adjustable between"acceptable" for very high compression ratios up to "near lossless". Efficient compression of textures for texture mapping on 2-D and 3-D meshes. Random access of video to allow functionalities such as pause, fast forward and fast reverse of stored video.

15.5 CORE VIDEO OBJECT PROFILE The next profile under consideration, with the working name Core, includes the following tools: All the tools for Simple Bi-directional prediction mode (B) H.263/MPEG-2 Quantization Tables Overlapped Block Motion Compensation Unrestricted Motion Vectors
32

Four Motion Vectors per Macroblock Static Sprites Temporal scalability Frame-based Object-based Spatial scalability (frame-based) Tools for coding of interlaced video

16. Current Developments


IPMP Extension Multi User Worlds Advanced Video Coding Audio Extensions Animation Framework eXtension

Animation Framework eXtension: The Animation Framework extension (AFX pronounced effects) provides an integrated toolbox for building attractive and powerful synthetic MPEG-4 environments. The framework defines a collection of interoperable tool categories that collaborate to produce a reusable architecture for interactive animated contents. In the context of AFX, a tool represents functionality such as a BIFS node, a synthetic stream, or an audio-visual stream. AFX utilizes and enhances existing MPEG-4 tools, while keeping backwardcompatibility, by offering:

Higher-level descriptions of animations (e.g. inverse kinematics) Enhanced rendering (e.g. multi-texturing, procedural texturing) Compact representations (e.g. piecewise curve interpolators, subdivision surfaces) Low bitrate animations (e.g. using interpolator compression and dead-reckoning) Scalability based on terminal capabilities (e.g. parametric surfaces tessellation) Interactivity at user level, scene level, and client-server session level Compression of representations for static and dynamic tools

Compression of animated paths and animated models is required for improving the transmission and storage efficiency of representations for dynamic and static tools. AFX defines a hierarchy made of 6 categories of models that rely on each other. Each model may have many tools. For example, BIFS tools prior to this specification belong to the lowest category of models of this pyramid. The 6 categories are:

33

1.

Geometric models.

Geometric models capture the form and appearance of an object. Many characters in animations and games can be quite efficiently controlled at this low-level; familiar tools for generating motion include key framing, and motion capture. Due to the predictable nature of motion, building higher-level models for characters that are controlled at the geometric level is generally much simpler. 2. Modeling models.

These are an extension of geometric models and add linear and non-linear deformations to them. They capture transformation of the models without changing its original shape. Animations can be made on changing the deformation parameters independently of the geometric models. 3. Physical models.

They capture additional aspects of the world such as an objects mass inertia, and how it responds to forces such as gravity. The use of physical models allows many motions to be created automatically and with unparallel realism. The cost of simulating the equations of motion may be important in a real-time engine and in many games, a physically plausible approach is often preferred. Applications such as collision restitution, deformable bodies, and rigid articulated bodies use these models intensively. 4. Biomechanical models.

Real animals have muscles that they use to exert forces and torques on their own bodies. If we already have built physical models of characters, they can use virtual muscles to move themselves around. In his pioneering work on Artificial Intelligence for games and animation, John Funge refers to the characters ability to exert some control over their motions as actuated (or animate) objects. These models have their roots in control theory. 5. Behavioral models.

After simple locomotion comes a characters behavior. A character may expose a reactive behavior when its behavior is solely based on its perception of the current situation (i.e. no memory of previous situations). Reactive behaviors can be implemented using stimulus-response rules, which are very used in games. Finite-States Machines (FSMs) are often used to encode deterministic behaviors based on multiple states. Goal-directed behaviors can be used to define a cognitive characters goals. They can also be used to model flocking behaviors. 6. Cognitive models.

If the character is able to learn from stimuli from the world, it may be able to adapt its behavior. These models are related to artificial intelligence techniques.

34

The models are hierarchical; each level relies on the next-lower one. For example, an autonomous agent (category 5) may respond to stimuli from the environment he is in and may decide to adapt his way of walking (category 4) that can modify physics equation (for example skin modeled with mass-spring-damp properties) or have influence on some underlying deformable models (category 2) or may even modify the geometry (category 1). If the agent is clever enough, it may also learn from the stimuli (category 6) and adapt or modify his behavioral models.

17. CONLUSIONS
17.1 MPEG Comparison Looking at MPEG-2 and later standards, it is important to bear in mind that they are not backwards compatible, i.e. strict MPEG-2 decoders/encoders will not work with MPEG-1. Neither will H.264 encoders /decoders work with MPEG-2 or previous versions of MPEG-4, unless specifically designed to handle multiple formats. However, there are various solutions available where streams encoded with newer standards can sometimes be packetized inside older standardization formats to work with older distribution systems. Since both MPEG-2 and MPEG-4 covers a wide range of picture sizes, picture rates and bandwidth usage , the MPEG-2 introduced a concept called Profile@ Level. This was created to make it possible to communicate compatibilities among applications. For example, the Studio profile of MPEG-4 is not suitable for a PDA and vice versa. 17.2 Conclusion Still pictures For single still pictures both JPEG and JPEG 2000 offers good flexibility in terms of picture quality and compression ratio. While JPEG 2000 compress slightly better then JPEG, especially at very high compression ratios, the momentum of the advantage compared to the price to pay for the extra complexity, makes it a less preferred choice of today. Overall, the advantages of JPEG in terms of inexpensive equipment both for coding and viewing make it the preferred option for still picture compression. 17.3 Conclusion Motion pictures Since the H.261/H.263 recommendations are neither international standards nor offer any compression enhancements compared to MPEG, they are not of any real interest and is not recommended as suitable technique for video surveillance. Due to its simplicity, the widely used Motion JPEG, a standard in many systems, is often a good choice. There is limited delay between image capturing in a camera, encoding,
35

transferring over the network, decoding, and finally display at the viewing station. In other words, Motion JPEG provides low latency due to its simplicity (image compress ion and complete individual images), and is therefore also suitable for image processing, such as in video motion detection or object tracking. Any practical image resolution, from mobile phone display size (QVGA) up to full video (4CIF) image size and above (megapixel), is available in Motion JPEG. However, Motion JPEG generates a relatively large volume of image data to be sent across the network. In comparison, all MPEG standards have the advantage of sending a lower volume of data per time unit across the network (bit-rate) compared to Motion JPEG, except at low frame rates. At low frame rates, where the MPEG compression cannot make use of similarities between neighboring frames to a high degree, and due to the overhead generated by the MPEG streaming format, the bandwidth consumption for MPEG is similar to Motion JPEG. MPEG1 is thus in most cases more effective than Motion JPEG. However, for just a slightly higher cost, MPEG2 provides even more advantages and supports better image quality comprising of frame rate and resolution. On the other hand, MPEG-2 requires more network bandwidth consumption and is a technique of greater complexity. MPEG4 is developed to offer a compression technique for applications demanding less image quality and bandwidth. It is also able to deliver video compression similar to MPEG1 and MPEG2, i.e. higher image quality at higher bandwidth consumption. If the available network bandwidth is limited, or if video is to be recorded at a high frame rate and there are storage space restraints, MPEG may be the preferred option. It provides a relatively high image quality at a lower bit-rate (bandwidth usage). Still, the lower bandwidth demands come at the cost of higher complexity in encoding and decoding, which in turn contributes to a higher latency when compared to Motion JPEG. Looking ahead, it is not a bold prediction that H.264 will be a key technique for compression of motion pictures in many application areas, including video surveillance. As mentioned above, it has already been implemented in as diverse areas as high-definition DVD (HD-DVD and Blu-ray), for digital video broadcasting including high-definition TV, in the 3GPP standard for third generation mobile telephony and in software such as QuickTime and Apple Computers Mac OS X operating system. H.264 is now a widely adopted standard, and represents the first time that the ITU, ISO and IEC have come together on a common, international standard for video compression. H.264 entails significant improvements in coding efficiency, latency, complexity and robustness. It provides new possibilities for creating better video encoders and decoders that provide higher quality video streams at maintained bit-rate (compared to previous standards), or, conversely, the same quality video at a lower bit-rate. There will always be a market need for better image quality, higher frame rates and higher resolutions with minimized bandwidth consumption. H.264 offers this, and as the H.264 format becomes more broadly available in network cameras, video encoders and video management software, system designers and integrators will need to make sure that the products
36

and vendors they choose support this new open standard. And for the time being, network video products that support several compression formats are ideal for maximum flexibility and integration possibilities.

18. FUTURE OPTIONS FOR MPEG-4


MPEG-4 is still being developed and all new parts will work with the old formats Studio quality versions for HDTV Digital cinema 45 - 240 Mbit/sec H.264 Home video cameras with MPEG-4 output straight to the web from the hard drive Integrated Service Digital Broadcast (ISDB) Newspaper + TV + data Integration with MPEG 7 databases Games with 3D texture mapping TeleVision Modelling Language (TVML) Computer generated TV programs + presenters - Max Headroom?? Information booths Talking objects - fridge, cars, toaster? Security cameras over the web Interactive manuals and training materials New downloadable interactive music format, SAOL

19. REFERENCES

www.google.com www.wikipedia.org
mpeg.chiariglione.org/ www.axis.com/files

37

38

39

40

41