Sie sind auf Seite 1von 7

Evaluating DSP Processor Performance

Berkeley Design Technology, Inc.

Introduction the DSP performance of general-purpose processors. In


particular, engineers may be interested in evaluating the
Digital signal processing (DSP) is the application of DSP performance of general-purpose processors that
mathematical operations to digitally represented signals. have been designed to provide some support for DSP.
Because digital signals can be processed with cost-effec- Many manufacturers of desktop general-purpose proces-
tive digital integrated circuits, DSP systems can econom- sors have enhanced the signal processing capabilities of
ically accomplish complex tasks, such as speech their processors by adding instructions and hardware ac-
synthesis and recognition, that would be difficult or im- celerators. For example, Intel has developed the MMX
possible to accomplish using conventional analog tech- and SSE extensions for the Pentium processor line. And
niques. in the embedded systems arena, many microcontroller
The market for products using DSP technology, such and embedded processor manufacturers have also added
as wireless communication devices and digital audio ap- DSP functionality to their processors. An example is Hi-
pliances, is growing rapidly. Semiconductor manufac- tachi’s SH-DSP, which combines microcontroller and
turers have responded to this demand by producing an DSP features.
expanding array of DSP processors—microprocessors
designed specifically for digital signal processing. Se- What is DSP Processor Performance?
lecting the right DSP processor for an application is a DSP processor performance can be measured in
difficult and time-consuming task for DSP system de- many ways. The most common metric is the time re-
signers. This paper presents a methodology for evaluat- quired for a processor to accomplish a defined task. On
ing DSP processor performance, one of the key the other hand, memory usage and energy consumption
considerations in choosing a processor. may be equally—or even more—important in some ap-
plications. This paper will examine all three of these
DSP Processors
metrics, with a focus on execution time.
Strictly speaking, the term “DSP processor” applies Measuring DSP processor performance in a way that
to any microprocessor that operates on digitally repre- allows fair comparisons between processor families is
sented signals. In practice, however, the term refers to difficult. Furthermore, performance measurements are
microprocessors specifically designed to perform digital only useful to the typical engineer if the measurements
signal processing. Because most signal processing sys- can be related to the requirements of particular applica-
tems perform complicated mathematical operations on tions. To address these challenges, Berkeley Design
real-time signals, DSP processors use special architec- Technology, Inc. (BDTI) uses a two-fold methodology
tures to accelerate repetitive, numerically intensive cal- of algorithm kernel benchmarking and application pro-
culations. For example, DSP architectures commonly filing.
include circuitry to rapidly perform multiply-accumulate
operations, which are useful in many signal processing Traditional Approaches to Performance
algorithms such as filtering. Also, DSP processors often Measurement
contain multiple-access memory architectures that allow
the processor to simultaneously load multiple operands, MIPS, MOPS, and MACS
such as a data sample and a filter coefficient, in parallel Traditional approaches to performance measurement
with loading an instruction. In addition, DSP processors often use very simple metrics to describe processor per-
often include a variety of special memory addressing formance. The most common performance unit, MIPS
modes and program-flow control features designed to (millions of instructions per second), is misleading be-
accelerate the execution of repetitive operations. Lastly, cause of the varying amounts of work performed by in-
most DSP processors include specialized on-chip periph- structions—a typical instruction on one processor may
erals or I/O interfaces that allow the processor to effi- accomplish far more work than a typical instruction on
ciently interface with other system components, such as another processor. This is especially true on DSP proces-
analog-to-digital converters and host processors. sors, which often have highly specialized instruction
This paper focuses on the performance of program- sets. Without some gauge of the relative efficiency of
mable DSP processors. The performance evaluation different instruction sets, MIPS figures are only useful
methodology we describe can also be applied to measure within the context of a single known processor architec-

© 1997-2002 Berkeley Design Technology, Inc.


ture. Similarly, MOPS (millions of operations per sec- parisons between different processor families than MI-
ond) suffers from a related problem—what counts as an PS, MOPS, or MACS.
operation and the number of operations needed to ac- This approach is used by the Standard Performance
complish useful work vary greatly from processor to pro- Evaluation Corporation in the popular SPEC bench-
cessor. marks for general-purpose processors and systems. In
Other commonly quoted performance measurement DSP, examples of applications include speech coders
units can also be misleading. Because multiply-accumu- (CELP, VSELP, GSM, etc.), modems (V.34, V.90, etc.),
late operations are central to many DSP algorithms, such and disk drive servo control programs. This approach
as FIR filtering, correlation, and dot-products, some pro- works best in cases where there is application software
cessor manufacturers quote performance in MACS (mul- portability—i.e., when the application is coded in a high-
tiply-accumulates per second). However, DSP level language like C. Benchmarking using applications
applications involve many operations other than multi- written in a high-level language amounts to benchmark-
ply-accumulates, so MACS alone are not a reliable pre- ing the compiler as well as the processor. Unfortunately,
dictor of performance. Furthermore, most DSP because of the poor efficiency of compilers for the most
processors have the ability to perform other operations in cost-effective DSP processors and the demanding per-
parallel with MAC operations. A processor's ability to formance requirements of the applications, the perfor-
perform such parallel operations can have large impact mance-critical portions of DSP software are typically
on inner-loop performance, but this is disregarded by coded in assembly language. Thus a benchmarking
MACS measurements. methodology that measures the compiler and the proces-
Neither MIPS, MOPS, nor MACS address secondary sor together does not reflect the needs of typical DSP
performance issues like memory usage and power con- system designers.
sumption. This is a severe limitation because execution Even if application benchmarks are coded in assem-
time means little if application memory requirements ex- bly language, one encounters four problems with appli-
ceed system constraints. Furthermore, if high memory cation benchmarking for DSP. First, most DSP
consumption requires using slower external memory, applications are not sufficiently well-defined to permit
then the processor’s speed may be reduced. Likewise, in fair comparisons. For instance, two implementations of a
a portable application, a processor is unusable if its pow- standard modem may carry out arithmetic with different
er consumption exceeds the available battery capacity. numerical precisions, depending on whether the objec-
Many manufacturers quote a “typical” power consump- tive is achieving the lowest possible error rate or mini-
tion at a given clock rate. However, power consumption mizing demands on the processor. Second, with most
varies with different instructions and data values, so such complex applications, it’s virtually impossible to ensure
specifications are suspect without details on the precise that software is optimal, or even near-optimal. Thus, ap-
instructions and data used in the measurement. Further- plication implementations may be benchmarking the
more, such measurements do not account for special programmer as much as the processor. Third, full appli-
power-saving modes available when a processor (or por- cation benchmarks tend to measure a system’s perfor-
tions of it) is idle. mance, not just the processor’s. Isolating the
It should be noted that energy consumption, which performance of a DSP processor from that of other sys-
determines battery life, is usually more important to de- tem components like external memory and microcontrol-
signers than power consumption. A DSP processor that ler coprocessors could be very difficult. Last, coding an
can execute its work quickly and then enter a power-sav- entire application for multiple processors could take
ing mode may consume less energy in a particular appli- years of engineering time, making it an impractical ap-
cation than another DSP processor with lower power proach for benchmarking.
consumption. BDTI's processor evaluations report ener- Algorithm Kernel Benchmarking
gy consumption.
Berkeley Design Technology’s methodology of algo-
Application Benchmarking rithm kernel benchmarking and application profiling is a
A common approach used to benchmark computer practical compromise between oversimplified MIPS-
systems is to use complete applications, or even suites of type metrics and overly complicated application-based
applications. Application benchmarking allows the benchmarks. Algorithm kernels are the building blocks
memory usage and energy consumption performance of of most signal processing systems and include functions
a processor to be measured. And it is more suited to com- such as fast Fourier transforms, vector additions, filters,

© 1997-2002 Berkeley Design Technology, Inc.

PAGE 2 OF 7
etc. Algorithm kernels offer several compelling advan-
tages as benchmarks: Example
Function Description
• Relevance. Algorithm kernels can be selected by Applications
examining DSP applications and focusing on those Finite impulse
portions of the applications that account for the larg- response filter that Speech process-
Real Block
est share of the processing time. This guarantees operates on a block of ing (e.g., G.728
FIR
their relevance. real (not complex) speech coding).
data.
• Ease of specification. By virtue of their modest
size, algorithm kernels can be well-defined: a speci- FIR filter that oper-
Complex Modem channel
ates on a block of
fication can state their input and output require- Block FIR equalization.
complex data.
ments, include test vectors to verify functional
conformance, and indicate which algorithm variants FIR filter that oper- Speech process-
Real Single-
ates on a single sam- ing, general filter-
and optimizations are allowable. For example, there Sample FIR
ple of real data. ing.
are many techniques for implementing a FFT. With-
out specifying the exact type of FFT, one cannot Least-mean-square Channel equaliza-
LMS adaptive filter; oper- tion, servo con-
fairly compare two processors’ FFT execution
Adaptive FIR ates on a single sam- trol, linear
times. ple of real data. predictive coding.
• Optimization. Because algorithm kernels are of a Infinite impulse
moderate size, a skilled programmer can write the Audio process-
response filter that
code in assembly language and be fairly certain that IIR ing, general filter-
operates on a single
ing.
his or her implementation is optimal, or very close to sample of real data.
optimal, on a given processor. Convolution, cor-
• Ease of implementation. Due to their moderate relation, matrix
Sum of the pointwise
size, algorithm kernels can be implemented in a rea- Vector Dot multiplication,
multiplication of two
sonable amount of time, even with thorough optimi- Product multi-dimen-
vectors.
zation. sional signal pro-
cessing.
The BDTI Benchmarks™, the basic suite of algo- Pointwise addition of Graphics, com-
rithm kernels used in BDTI’s DSP processor bench- Vector Add two vectors, produc- bining audio sig-
ing a third vector. nals or images.
marking, are shown in Table 1. BDTI calculates
execution time, memory usage, and energy consumption Find the value and Error control cod-
for each benchmark. Most of the benchmarks involve Vector location of the maxi- ing, algorithms
Maximum mum value in a vec- using block float-
transforming an input data set into an output data set. The
tor. ing-point.
exception is the Control benchmark, in which the proces-
sor must execute a contrived sequence of operations, Decode a block of bits
Viterbi Error control cod-
that has been convolu-
such as conditional branching and subroutine calls, that Decoder ing.
tionally encoded.
are commonly needed in control code. As DSP applica-
tions become more complex and system designers try to A sequence of con- Virtually all DSP
trol operations (test, applications
achieve higher levels of system integration, DSP proces- Control
branch, push, pop, include some con-
sors will increasingly be called upon to perform control and bit manipulation). trol code.
functions.
Fast Fourier Trans- Radar, sonar,
With the exception of the control benchmark, BDTI 256-Point form converts a time- MPEG audio
optimizes each benchmark for execution time. The Con- In-Place FFT domain signal to the compression,
trol benchmark is optimized for memory usage since frequency domain. spectral analysis.
memory usage is usually a greater concern than speed for Unpacks variable- Audio decom-
control code. Bit Unpack length data from a bit pression, proto-
stream. col handling.
Measuring Algorithm Kernel Execution
TABLE 1. BDTI Benchmarks.
There are several ways to measure a processor’s per-
formance on an algorithm kernel benchmark. Cycle-ac-
curate software simulators usually provide the most

© 1997-2002 Berkeley Design Technology, Inc.

PAGE 3 OF 7
convenient method for determining cycle counts. A cy- yielding benchmark results that are not what might be
cle-accurate simulator models a processor’s execution of expected from a simple comparison of MIPS.
instructions and keeps accurate cycle counts by making Texas Instruments’ TMS320C6203 is a VLIW-based
appropriate adjustments when factors such as pipeline processor that can issue and execute up to eight instruc-
interlocking or bus contention slow the operation of the tions per instruction cycle. Hence, at the 300 MHz clock
processor. Software simulators offer a controlled, flexi- rate shown here, it has a MIPS rating of 2400 MIPS.
ble, and interactive environment for testing and optimiz- However, its relative speed compared to another Texas
ing code. Some software simulators include support for Instruments DSP processor, the architecturally conven-
macros or scripts that can automate performance mea- tional TMS320C5416, is not nearly as high as the differ-
surement and functionality verification and allow engi- ence between the two processors’ MIPS ratings suggests.
neers to quickly see how code changes affect a Despite a MIPS ratio of 15:1, the TMS320C6203 exe-
processor’s performance. cutes the FFT benchmark only 7.8 times faster than the
Hardware-based application development tools can TMS320C5416. A major reason for this is that
also be used to measure execution time and are needed to TMS320C6203 instructions are simpler than
precisely gauge energy consumption. Hardware tools, TMS320C5416 instructions, so the TMS320C6203 re-
such as emulators, allow the user to download code from quires more instructions to accomplish the same task. In
a PC to the target processor. Using a debugger, most addition, the TMS320C6203 is not always able to make
hardware emulators allow the processor to step through use of all of its available parallelism because of limita-
the code line by line, or to run the code until a breakpoint tions such as data dependencies and pipeline effects.
is reached. This example illustrates why using MIPS ratings to
Code can be run in continuous loops on development compare the performance of different processors may be
boards to measure energy consumption. Energy con- misleading, and why BDTI believes that algorithm ker-
sumption is measured by isolating the power going to the nel-based benchmarking provides much more meaning-
DSP processor from the power going to other system ful results with which to compare processor
components, running a benchmark in a repeating loop, performance.
and using a current probe to record the time-varying in- Of course, one must be cautious when interpreting
put current under carefully controlled conditions. benchmark results. For example, a processor’s data word
Such energy consumption measurements can be width affects memory usage as well as numerical accu-
time-consuming and difficult. A less accurate but easier racy. The benchmark results for a finite impulse re-
alternative is to obtain a credible estimate of typical pow- sponse filter implemented on a 24-bit processor might
er consumption and multiply it by the time taken to exe- show 50% more data memory usage than the same filter
cute a benchmark. This is the approach taken by BDTI.
Determining benchmark performance for new pro- 70
cessors without software or hardware development tools 60
is a tedious and error-prone process. One must manually Fixed-Point Floating-Point
calculate the time required to execute each instruction in 50
the benchmark and be careful to check that the bench- 40
marks are functionally correct. Because pipeline inter- 30
locks or bus conflicts can slow execution time, the
20
processor architecture must be thoroughly understood
before instruction execution times are calculated. 10
0
Benchmark Results
TMS320C5416

TMS320C6203

TMS320C6701
Pentium III-C

Figure 1 shows the execution time results of several


(150 MIPS)

(160 MIPS)
DSP56311

Pentium III
(300 MHz)

(300 MHz)

(167 MHz)
MSC8101

processors on the BDTI fast Fourier transform (FFT)


(1 GHz)

(1 GHz)

benchmark. The FFT is a computationally efficient algo-


rithm for computing the discrete Fourier transform,
which converts time-domain signals into their frequen- FIGURE 1. Execution times for a 256-point complex FFT,
cy-domain representations. The results illustrate how ar- in microseconds (lower is better).
chitectural features affect a processor’s performance, Note: Times are calculated for the fastest version of each processor projected
to be available in June 2000. For the processors with on-chip cache, “-C”
indicates performance with cache pre-loaded.

© 1997-2002 Berkeley Design Technology, Inc.

PAGE 4 OF 7
implemented on a 16-bit processor. This increased mem- A processor’s performance on an application is esti-
ory usage is a result of the extended precision of the 24- mated by combining the results of the benchmarks with
bit data. In fact, since the 24-bit processor is calculating the results of the application profiling. Multiplying the
the filter result to 50% greater precision, the 24-bit pro- benchmark execution times by the number of occurrenc-
cessor is in a sense performing more work—a fact not re- es of each benchmark (or a similar algorithm kernel)
flected in the benchmark results. If the application needs yields an estimate of the time required to execute the ap-
additional precision, the 24-bit processor may be an ex- plication. Comparing the application execution time es-
cellent choice. On the other hand, if 16-bit precision is timates of different processors allows an engineer to
sufficient, then the 24-bit processor may be a poor choice gauge the relative suitability of each processor for the ap-
because it consumes more data memory. plication.

Application Profiling Application profiling can be illustrated with the ex-


ample of a hypothetical 10-band audio graphic equalizer.
The results of algorithm kernel benchmarks are use- A stream of digital audio samples enters the graphic
ful but incomplete without an understanding of how the equalizer at a fixed sampling rate. The equalizer’s output
kernels are used in actual applications. “Application pro- is produced by filtering the input samples with 10 sepa-
filing,” which refers to a set of techniques used to mea- rate cascaded biquad IIR filters. We will estimate the
sure or estimate the amount of time, memory or other time required per sample as 10 times the time needed to
resources that an application spends executing its vari- perform BDTI’s eight-biquad IIR filter benchmark. Fig-
ous subsections, can be used to relate algorithm kernels ure 2 illustrates the estimated execution times of four
to actual applications. 16-bit fixed-point processors from Analog Devices, Mo-
Application profiling at the algorithm kernel level torola, Lucent Technologies, and Texas Instruments.
looks at the number of times key algorithm kernels are Since the maximum allowable execution time is the re-
executed when an application is run. This can be done in ciprocal of the system’s sampling rate, the execution
a number of ways. Code in high-level languages such as time estimate indicates whether or not a processor has
C, for example, is an excellent source of profiling infor- enough performance to implement the equalizer. The re-
mation because most algorithm kernels can be identified sults suggest that these processors could all easily handle
as subroutines. If assembly code is available, profiling stereo operation at sampling rates above 48 kHz. How-
information may be extracted by running the code on an ever, processing requirements beyond the filtering itself,
instruction set simulator equipped with profiling capabil- such as control code, must also be considered. BDTI's
ities, or by setting break points in key sections of code to
see how often they are executed. Profiling information
can also be estimated by studying application specifica-
tions or block-level signal flow diagrams. 10

Application profiling allows developers to estimate 8


the relative importance of each algorithm kernel bench-
mark in a particular application. Of course, it’s not a per- 6
fect process. If the number of benchmarks is limited to a
reasonable number, say ten or fifteen, then in many cases 4
there won’t be an exact match between every algorithm
2
found in a complex application and a benchmark. Engi-
neers will have to approximate some of the application’s 0
processing by using benchmarks that perform similar,
TMS320C549
ADSP-2189M

but not identical, computations. It’s also important to


(120 MIPS)

(120 MIPS)
DSP56652
(75 MIPS)

(70 MIPS)

DSP1620

note that application profiling may not identify some of


the optimizations that will be possible when assembly
code is written. For example, a programmer may notice
that a set of intermediate values used in one algorithm
kernel is also used in a later algorithm kernel. By reusing
the values, the programmer may be able to significantly FIGURE 2. Execution times for graphic equalizer, in
reduce the amount of processing required in the second microseconds (lower is better)
algorithm kernel. Note: Execution times are calculated for the fastest version of each processor
available in December 1999.

© 1997-2002 Berkeley Design Technology, Inc.

PAGE 5 OF 7
Control benchmark can be used to compare processor's We expect that DSP systems will continue to become
relative efficiency at executing control code. more sophisticated and demand greater computational
performance. At the same time, semiconductor vendors
Other Considerations will continue to develop more powerful DSP processors
Although performance is a leading consideration, and integrate these processors with other system compo-
many other factors affect the choice of a DSP processor. nents such as microcontrollers and peripherals. As sys-
Application development tools, for instance, cannot be tems become more complicated and processor choices
overlooked. Without effective application development grow, designers will need good estimates of a proces-
tools, writing application software can be difficult no sor’s DSP performance. The methodology outlined
matter how strong the processor’s performance. Like- above will be an excellent starting place for calculating
wise, chip vendor and third-party application engineer- these estimates.
ing support can be invaluable when problems arise.
References
Additionally, designers cannot overlook physical size
considerations and must choose a processor that is avail- [1] Buyer’s Guide to DSP Processors, Berkeley, Cali-
able in an appropriate package. fornia: Berkeley Design Technology, Inc., 1994,
Cost is another critical concern. There are two ways 1995, 1997, 1999, 2001. This 846-page technical
to view the ratio of cost to performance. In some instanc- report discusses DSP benchmarking methodology in
es, additional performance beyond the required mini- detail and contains extensive benchmarking data for
mum will remain unused. In this situation, designers popular DSP processors. The report provides execu-
typically seek the lowest-cost processor with adequate tion-time application profiling data from several
performance. At other times, the excess performance common DSP applications. Excerpts from this
may allow additional features to be added to the product. report, as well as a pocket guide to DSP processors,
Or, the designer may want a line of code-compatible are available at www.BDTI.com.
DSP processors with performance levels appropriate for [2] Inside the StarCore SC140, Berkeley, California:
different members of an entire product line. In this situ- Berkeley Design Technology, Inc., 2000. This report
ation, a cost-execution time product metric (the execu- provides a comprehensive qualitative analysis of the
tion time of a processor multiplied by the unit cost) may StarCore SC140's architecture and features, along
be useful. Figure 3 shows the cost-execution time prod- with a quantitative analysis based on results from
uct of several processors on BDTI’s FFT benchmark. BDTI's DSP benchmark suite. The SC140's perfor-
mance is compared to that of key competitors, with
Designers must also remember that minimizing sys-
benchmark results analyzed in terms of underlying
tem cost may not always mean minimizing DSP proces-
sor cost. For example, one processor may use memory
more efficiently than a slightly less expensive processor. 4000
Fixed-Point Floating-
If the lower memory usage can eliminate one memory 3500 Point
chip from the system, the more expensive processor may 3000
minimize system cost. Designers must also weigh the 2500
cost of engineering time and carefully consider how the
2000
quality of application development tools will affect
1500
product development schedules.
1000
Lessons Learned 500
0
There is no easy way to evaluate DSP processor per-
formance meaningfully. Traditional performance units
TMS320C5416

TMS320C6203

TMS320C6701

like MIPS and MOPS are misleading and do not reflect


(150 MIPS)

(160 MIPS)
DSP56311

(300 MHz)

(300 MHz)

(167 MHz)
MSC8101

many relevant factors like application execution time,


memory usage, and energy consumption. Application
benchmarks, too, suffer from limitations that make fair
comparisons difficult. Fortunately, a methodology of al-
FIGURE 3. Cost-execution time product for a 256-point
gorithm kernel benchmarking and application profiling complex FFT, in microsecond-dollars (lower is better).
provides good estimates of processor performance Note: Results are calculated for the fastest version of each processor projected
weighted to the target application. to be available in June 2000. Costs are projected quantity 10,000 pricing for
June 2000.

© 1997-2002 Berkeley Design Technology, Inc.

PAGE 6 OF 7
architectural strengths and weaknesses. The report BDTI customers include:
includes coverage of Motorola’s SC140-based
3Com Mentor Graphics
MSC8101.
ARM Microsoft
[3] Phil Lapsley, Jeff Bier, Amit Shoham, and Edward AMD MIPS
A. Lee, DSP Processor Fundamentals: Architec- Analog Devices Motorola
tures and Features, Berkeley, California: Berkeley Cadence National Semiconductor
Design Technology, Inc., 1996. An introductory Cisco NEC
textbook on DSP processor architectures which dis- Compaq Nokia
cusses how processor architecture affects perfor- Conexant Philips
mance. CSF Thomson Principal Financial
Dow Chemical Raytheon
About Berkeley Design Technology DSP Group RealNetworks
Berkeley Design Technology, Inc. (BDTI) is a soft- E.M. Warburg Pincus Replay Networks
ware and technical services company focused on digital Ericsson STMicroelectronics
signal processing (DSP) technology. The company was Fujitsu Sony
founded by U.C. Berkeley faculty and researchers. Hewlett-Packard StarCore
Hitachi Sun Microsystems
BDTI specializes in the analysis, benchmarking,
IBM Synopsys
evaluation, and development of technology used to im-
Infineon Technologies Texas Instruments
plement DSP applications. Specifically, the company:
IDT VLSI Technology
• Performs in-depth technical evaluations of micro- Intel Wind River Systems
processors. LSI Logic Xilinx
• Develops DSP application software and firmware. Lucent Technologies Zoran
• Publishes technical reports and books on DSP tech-
nology, including Buyer's Guide to DSP Processors,
Inside the StarCore SC140, and DSP Processor Fun- BERKELEY DESIGN TECHNOLOGY, INC.
damentals. 2107 Dwight Way, Second Floor
• Analyzes DSP algorithms and applications. Berkeley, CA 94704 U.S.A.
• Evaluates design tools and advises on tool selection
+1 (510) 665-1600
and design methodologies.
Fax: +1 (510) 665-1680
Email: info@BDTI.com
• Develops specifications and recommendations for web: www.BDTI.com
new DSP processors, software, and tools.
• Provides DSP-related training classes. International Representatives
JAPAN
Shinichi Hosoya
Japan Kyastem Co.
Tokyo, Japan
+81 (425) 23 7176
Fax: +81 (425) 23 7178
bdt-info@kyastem.co.jp

© 1997-2002 Berkeley Design Technology, Inc.

PAGE 7 OF 7

Das könnte Ihnen auch gefallen