Sie sind auf Seite 1von 8

Tumor Volume Measurement and Volume Measurement Comparison Plug-ins for VolView Using ITK

Teo Popaa, Luis Ibanezb, Elliot Levya, Amy Whitea, Jill Brunoa, Kevin Clearya
a

Imaging Science and Information Systems (ISIS) Center, Department of Radiology, Georgetown University, 2115 Wisconsin Avenue, Suite 603, Washington, DC, USA b Kitware Inc., Clifton Park, NY, USA
ABSTRACT

Volume measurement plays an important role in many medical applications in which physicians need to quantify tumor growth over time. For example, tumor volume estimation can help physicians diagnose patients and evaluate the effects of therapy. These measurements can also help researchers compare segmentation methods. For researchers to quickly check the results of volume data processing, they need a graphical interface with volume visualization features. VolView is an interactive visualization environment which provides such an interface. The plug-in architecture of VolView allows it to be used as a visualization platform for evaluation of advanced image processing algorithms. In this work, we implemented VolView plug-ins for two volume measurement algorithms and three volume comparison algorithms. One volume measurement algorithm involves voxel counting and the other provides finer volume measurement by anti-aliasing the tumor volume. The three volume comparison methods are a maximum surface distance measure, mean absolute surface distance, and a volumetric overlap measure. In this implementation, we rely heavily on software components from the open source Insight Segmentation and Registration Toolkit (ITK). The paper also presents the use of the VolView environment to evaluate liver tumor segmentation based on level set techniques. The simultaneous truth and performance level estimation (STAPLE) method was used to evaluate the estimated ground truth from multiple radiologists. Keywords: volume measurement, volume comparison, VolView, ITK, open source, segmentation

1. INTRODUCTION Segmentation is one of the major problems of medical image processing. Because of the complexity and variability of anatomical structures, segmentation algorithms generally must be tuned for each particular case. The use of different segmentation techniques also requires some methods to evaluate the techniques and compare various segmentation methods. This can be achieved by performing a quantitative comparison among different segmentation algorithms. To make these comparisons, one can use 1) a ground truth computed from manual segmentation of several radiologists and 2) measures for comparing different segmentations. Different measures provide complementary information on the accuracy and precision of the segmentation result. For example, the Hausdorff distance gives more global misalignment values, but volume overlap is more sensitive to local misalignment of the compared regions. In this paper we implemented five measurement and comparison methods. This paper also presents a quantitative and qualitative comparison between manual segmentation and a semi-automatic technique based on threshold level sets and the STAPLE method1. We also introduce a rapid development environment based on the VolView plug-in architecture using ITK components.

cleary@georgetown.edu; phone 202-687-8253 ; fax 202-784-3479; www.isis.georgetown.edu

2. MATERIALS AND METHODS 2.1. Volume Measurement Algorithms To compute tumor volume, many radiologists use a volume ellipsoid formula (length x depth x width x 0.5233)2. This method gives reasonable volume estimates for tumors that have relatively spherical or ellipsoid shapes, which may be found in specific types of both benign and malignant tumors. However, this method is less accurate for the majority of tumors which are defined by irregular borders and may have necrotic centers. Additionally, this method only provides a set of three dimensions for later use. In medical image processing, tumor volume is traditionally measured by performing manual segmentation using multiple image slices, with or without computer assistance such as livewire techniques3. Volume calculation is then performed by incorporating the data obtained from each of the multiple slices. We have implemented two methods for volume measurement, one that involves voxel counting and another that provides finer volume measurement by anti-aliasing the tumor volume. We use the anti-aliased step to reduce aliasing artifacts that result in visualization of binary partitioned surfaces. 2.2. Volume Comparison Algorithms For volume comparison, we have implemented three methods that we consider important for a basic segmentation evaluation: maximum surface distance measure, mean absolute surface distance, and volumetric overlap measure. These methods do not report tumor changes in volumetric units rather, they indicate whether or not the tumor volume has changed. 1) For the maximum surface distance measure, which measures the largest difference between two tumor volumes, we used the Hausdorff-Chebyshev metric4. 2) The mean absolute surface distance computes the average distance between two volume data sets. There are cases where the Hausdorff distances for two volumes are approximately the same, but the average distances are very different. The mean surface distance measures the average distance between points on the surface of volumes. This filter extracts the contour of the second volume and computes the average distance between the first tumor and this surface. For computing the distance map, we used a Danielsson distance map algorithm that computes a signed distance map with pixel accuracy which is the approximation to the euclidean distance. This algorithm assures that the voxels inside the segmented object contour are considered as having negative distances. 3) We implemented the volumetric overlap metric5 by comparing sets of non-zero pixels from two binary image segmentations for relative overlap. This method is derived from a reliability measure known as the kappa statistic. The volumetric overlap is sensitive to both differences in size and in location and has been described in the literature for comparing two segmentation masks. The method gives a score of 1 for perfect agreement and 0 in the absence of overlap. The overlap measure depends on both the size and the contour of the object. These three comparison measures were applied to binary images; therefore, they did not take into account the possible corrections for partial volume effects that are implemented in the anti-alias method for measuring volume. Measurement methods gave us information only about the difference in volume between analyzed segmentations, where the comparison methods also gave us information about the difference in position relative to each other. 2.3. Graphical User Interface VolView is an intuitive, interactive system for volume visualization from Kitware Inc. that allows researchers and clinicians to quickly explore complex 3D medical or scientific images. Since VolView already provides advanced visualization and user interaction functionality, developers of new algorithms are free to focus their efforts on the development of image processing and analysis capabilities. An example of the graphical user interface (GUI) of VolView is shown in Figure 1.

Figure 1: Hausdorff distance plug-in demonstration using VolView. The figure shows the distance from the top right edge of the small cube to the top right edge of the large cube is 17.31 units. The plug-in architecture of VolView defines a very simple mechanism that specifies the number and type of parameters required from the user. VolView then creates GUI elements for those parameters and passes the corresponding values to the plug-in. This architecture is shown in Figure 2.

vtkVVProcess Input Data Filter A VolView Application Filter B

vtkVVPlugin GUI Description

Filter C

vtkVVProcess Export Data


Figure 2: Plug-in architecture of VolView

For developing plug-ins we implemented three VolView functions (not shown in Figure 2): 1) The Init() function defines the fundamental characteristics of the plug-in. These characteristics include its name, group, terse documentation, the number of GUI items needed for passing parameters from the user, and an estimation of bytes per voxel required for processing the image. 2) The UpdateGUI() function defines all the properties of the GUI items. This includes their text labels, type (scale, number, check box, etc), default value, range of values, and a short help message indicating the role of this parameter. 3) The ProcessData() function is where the ITK pipeline of the plug-in is actually executed. It usually involves creating the ITK pipeline, gathering all the parameters from the GUI, passing them to the ITK pipeline, and triggering the execution of the pipeline.

2.4. Use case employing volume measurement and comparison algorithms To evaluate the accuracy of a semi-automated tumor segmentation method we compared the semi-automated result with the estimated ground truth from manual segmentation performed by experienced radiologists and a medical student. This comparison was performed using the volume measurement and volume comparison algorithms defined above. We defined an accurate result as where the semi-automated segmentation measurements and comparison values are similar with the measurements and comparison values of one radiologist. Thus, we evaluate whether the comparison metric values (Hausdorff, Mean and Volume overlap) obtained using semi-automatic segmentation technique versus the STAPLE algorithm are similar with the values of the estimated result versus STAPLE ground truth. We also examined if the measurement metric values obtained with voxel counting or antialiased voxel counting are smaller than 20% of the volume determined from estimated ground truth. We also looked at the performance of a radiologist that has high sensitivity and specificity versus ground truth. Radiologist results with high sensitivity have a high confidence level for negatively classified voxels, whereas those techniques with a high specificity have a high confidence level for positively classified voxels.

2.5 STAPLE Algorithm The ground truth segmentation was estimated from a set of manual segmentations performed by several experienced radiologists and medical students. The result generated by the semi-automatic segmentation was evaluated by comparison with the estimated ground truth. We also estimated ground truth from the manually segmented data by binary thresholding of the result of the STAPLE algorithm1 with a constant value (95% probability). To estimate the ground truth from multiple raters we applied the expectation maximization method described by Warfield et al. 1. This method uses binary labeled images and also estimates the performance parameters of each rater in terms of sensitivity and specificity. As mentioned above, this method was also used to determine the most accurate and the least accurate value obtained from manual segmentation by each of the radiologists, relying on the final values of sensitivity and specificity. For the description of the method, we assume that an image containing N voxels is segmented by a number of raters equal to R. Let D be an N x R matrix describing the binary decisions made by each rater at each voxel of the image. Let T be a vector of N elements representing the unknown binary true segmentation, where each voxel has a value of 1 if it is part of the structure of interest, and a value of 0 otherwise. We define p = (p1,p2,pR) to be a vector where each element represents the sensitivity of the rater, defined as the fraction of true positives, and q = (q1,q2,qR) as a vector where each element holds the specificity of the rater, defined as the fraction of true negatives. The sensitivity and specificity parameters characterize the performance of each one of the R segmentations. By definition (D,T) denotes the full data and the likelihood function (or probability distribution) is defined as P ( D, T | p, q ) .

STAPLE is based on estimating the performance parameters p (sensitivity) and q (specificity) among the classified voxels. The parameters p and q are modeled independently for each rater. By these definitions, an expectationmaximization algorithm that estimates p and q from rater decisions can be derived as described in Warfield et al.1. The goal is to find the combination of p and q values that maximize the likelihood of D:

( p ' , q ' ) = arg max ln P ( D, T | p, q )


where the probability

P ( D, T | p, q ) depends on rater decisions and performance parameters.

The expectation maximization has four steps: 1) Initialization- the hypothesis is initialized with the average segmentation value of all radiologists. 2) Estimation the estimation of the likelihood of P ( D, T | p, q ) is based on the current hypothesis (p,q) and the observed data D. We use the expected value E(ln P ( D, T | p, q ) ) because the full data (D,T) is considered here to be a random variable. 3) Maximization the hypothesis (p,q) is replaced by the new hypothesis (p,q), that maximizes the estimate of the likelihood P ( D, T | p, q ) . 4) Convergence this is determined by calculating the rate of change of the sum of the true segmentation probability. These four steps are illustrated in Figure 3.

Initialization

Estimation Step

Maximization Step

Convergence

Figure 3: STAPLE General Expectation Maximization Framework

3. RESULTS 3.1.Plug-ins test example To verify that our implementation of the volume measurement and volume comparison algorithms was correct, we validated our results using two other software packages popular in the medical image processing community. 1) We validated the results from the volume measurement plug-ins with Analyze6, a software application for biomedical image processing from Mayo Clinic. 2) We validated the results from the volume comparison plug-ins with Valmet7, a public domain tool from UNC-Chapel Hill.

To examine the volume measurement indices, we tested the voxel counting and anti-aliased voxel counting measures on a synthetic data set that consisted of a sphere with a radius of 20 voxels. Each voxel was assumed to be of dimension 1 x 1 x 1 mm. The results are shown in Table 1. As the table demonstrates both techniques provide accurate results, but the anti-aliased method returned a more accurate volume. The voxel counting measure is closer than 0.5% to the true value, and the anti-aliased measure is closer than 0.1% to the true value. Method Computed volume using (4/3*r*r*r) Voxel counting Anti-aliased voxel counting Volume (mm3) 33510 33371 33484 Difference 0.42% 0.078%

Table 1: Synthetic data test for volume measurement indices

Next, we acquired four anonymized liver tumor CT data sets from Georgetown University Hospital. The slice thickness was 1.73 mm. The data sets were manually segmented by three experienced radiologists, and once by a medical student; the data sets were also semi-automatically segmented using threshold level sets. Figure 4 shows a VolView screenshot with the segmented tumor using surface rendering. The computation of the volume using voxel counting is shown on the left hand side.

Figure 4: VolView plug-in for computing the liver tumor volume and representative axial slice.

3.2. Segmentation Comparison Tables 2 and 3 show a comparison between the semi-automatic segmentation and one of the radiologists using the metrics described in this paper. From these tables we can quantify only the difference between the two results but we are not able to conclude which result is closer to the ground truth. Table 4a shows a comparison between the estimated ground truth computed with STAPLE and the first radiologist who segmented the data set. These results are only intended to show the application of the method and the distance measures. Since the results from the first radiologist were used to compute the STAPLE data set, a better evaluation would be to have more radiologists involved in the study and use a leave one out approach for this comparison. Table 4b shows a comparison between the estimated ground truth computed with STAPLE and the semi-automatic

segmentation using threshold level sets. Table 4c shows the results of voxel counting for the first radiologist, estimated ground truth, and semi-automatic techniques. While this was a very limited study, some tentative conclusions can be drawn. From looking at Tables 2 and 3 it appears that the radiologist is more in agreement with the STAPLE results than the semi-automatic segmentation. From looking at Table 4c it seems that both the radiologist and the semi-automatic segmentation tend to oversegment (larger volumes) as compared to STAPLE (three out of the four are larger). The volume measurement values of the semi-automatic segmentation were higher than values calculated manually in three cases, and lower for patient number four. We were unable to find any correlation between different comparison measures or between comparison measures and volume measures. We also computed specificity and sensitivity between STAPLE and semi-automatic segmentation and we compared the numbers with the sensitivity and specificity of the raters. The values were outside the range of rater decision.

Comparison metrics Patient 1 Patient 2 Patient 3 Patient 4

Mean Distance (mm) 0.328 0.363 0.354 0.560

Hausdorff Distance (mm) 3 2.236 2.23 5.196

Volumetric Overlap (between 0-1) 0.837 0.810 0.802 0.870

Table 2: Comparison between radiologist 1 and semi-automatic segmentation Voxel counting (mm^3) Rad01 Rad02 Patient 1 Patient 2 Patient 3 Patient 4 6708 2604 5744 84996 7162 2243 5763 84648 Anti-aliased voxel counting (mm^3) Rad01 Rad02 SemiAuto 7251 7631 7342 2863 2493 8398 6101 6032 6555 86212 85896 46440

SemiAuto 7303 8493 6522 46314

Table 3: Measurement results of the first, second radiologist, and the semi-automatic segmentation Staple vs. Radiologist 1 Patient 1 Patient 2 Patient 3 Patient 4 Hausdorff Distance (mm) 3.00 2.23 2.24 6.70 Mean Distance (mm) 1.08 1.04 1.148 1.05 Volume Overlap (between 0-1) 0.896 0.885 0.883 0.915

Table 4a: Comparison between first radiologist and estimated ground truth from STAPLE Staple vs. SemiHausdorff Distance Mean Distance (mm) Volume Overlap Auto Segmentation (mm) (between 0-1) Patient 1 5.65 1.078 0.686 Patient 2 3.00 1.68 0.701 Patient 3 4.89 1.19 0.723 Patient 4 5.09 2.78 0.730 Table 4b: Comparison between results of semi-automatic segmentation and estimated ground truth from STAPLE

Voxel counting (mm^3) Radiologist 1 Semi-automatic STAPLE Patient 1 6708(95%) 7299(103%) 7036 Patient 2 2603(113%) 4186(183%) 2286 Patient 3 5744(121%) 6513(137%) 4726 Patient 4 84996(114%) 46312(59%) 77781 Table 4c: Voxel counting measure for radiologist 1, semi-automated segmentation, and estimated ground truth (percent difference from STAPLE shown in parentheses) 4. CONCLUSIONS The use of the open-source ITK library in combination with VolView provides a complete environment for tumor visualization and user interaction. This approach enables algorithm developers to focus their efforts on development of the image processing and analysis capabilities, and to rely on VolView for other functionality. Semi-automated segmentation modules and their plug-ins were contributed earlier by other groups, and are available for download from the ITK site. The results showed that the semi-automated segmentation technique performed outside the range of the radiologists segmentations in the case of comparison methods. Therefore, we conclude that with the default parameters the semiautomatic segmentation method used is less accurate than segmentation by radiologists. However, a further exploration of the parameter settings of the segmentation method may improve the results. The tumor volume measurement and volume comparison methods presented in this paper are clinically significant and provide a potentially useful assessment of tumor growth or cure. These measurements allow both clinicians and researchers to gather precise and accurate data, non-invasively, in a variety of settings that include both benign and malignant tumors. Tumor regression can be monitored in response to chemotherapy or after radiofrequency ablation, and even small increases in tumor volume can be detected, potentially preventing unwanted growth or recurrence of malignant tissue. These metrics can also be extended to other medical applications such as observing liver motion in 4D CT datasets in the field of image-guided radiation therapy.

ACKNOWLEDGEMENT
This work was funded through an A2D2 (Algorithms, Adapters, and Data Distribution) purchase order award from the National Library of Medicine at the National Institute of Health.

REFERENCES
1. S. Warfield, K. Zou, and W. Wells, Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, 23(7):903-21, 2004. 2. S. Goodwin, S. Bonilla, D. Sacks, R. Reed, J. Spies, W. Landow, and R. Worthington-Kirsch, Reporting Standards for Uterine Artery Embolization for the Treatment of Uterine Leiomyomata. Journal of Vascular and Interventional Radiology, Vol. 14, 467S-476S, 2003. 3. E. N. Mortensen and W. A. Barrett, Interactive segmentation with intelligent scissors. Graphical Models and Image Processing, Vol. 60, No. 5, pp. 349384, 1998. 4. G.A. Edgar, Measure, Topology, and Fractal Geometry, Springer, 2nd Edition, 1990. 5. K. Zou, S. Warfield, A. Bharatha, C. Tempany, M. Kaus, S.Haker, W. Wells, F. Jolesz, R. Kikinis, Statistical Validation of Image Segmentation Quality Based on a Spatial Overlap Index. Academic Radiology, 11(2):178 189, 2004. 6. R. Robb, The Biomedical Imaging Resource at Mayo Clinic. IEEE Transactions on Medical Imaging 20(9):854-861, 2001. 7. G. Gerig, M. Jomier, M. Chakos, Valmet: A New Validation Tool for Assessing and Improving 3D Object Segmentation. Medical Imaging Computing and Computer Aided Interventions (MICCAI), 516-523, 2001.

Das könnte Ihnen auch gefallen