Anda di halaman 1dari 4

2008 International Conference on Computer Science and Software Engineering

Medical Image Retrieval Based on Latent Semantic


Indexing
Qin Chen, Xiaoying Tai, Baochuan Jiang, Gang Li, Jieyu Zhao
Institute of Information Science & Engineering
Ningbo University
Ningbo, Zhejiang, China
kokoyy97@yahoo.com.cn taixiaoying@nbu.edu.cn
AbstractTo improve the performance of content-based medical
image retrieval, herein an algorithm which makes use of latent
semantic indexing (LSI) technology on gastroscopic image
retrieval is proposed. First extract images color histogram and
color autocorrelogram of low-level features, and then use
normalizing, term weighting and singular value decomposition to
realize low-level features mapping into high-level semantic
features. In this way, the retrieval results will be more in
accordance with the query images semantic content. Based on
above idea, a prototype system which supports query by example
image is designed and implemented. The experimental results
according to the prototype system show that the approach
proposed in the paper is effective to gastroscopic image retrieval.

II.

A. Color histogram
This paper adopts one dimension color histogram based on
HSV color space. We can shift RGB color space into HSV
color space [2] to obtain h [0,360] , s [ 0,1] , v [0,1] .
Red is the main color of gastroscopic image [3] while
yellow and green are the color of gastric cancer cell. In the
converted HSV space, component h concentrates on [0, 100]
and [300,360], and s and v components distribution are
relatively more homogeneous. According to above-mentioned
characteristics, we quantize the h component into 16 ranks
nonuniformly, and quantize the s and v components into 4
ranks uniformly.

Keywords-color histogram; color autocorrelogram; latent


semantic indexing; singular value decomposition

I.

Cluster these three components after quantization. The


ultimate color after clustering is: C = 16H + 4S + V , where
C [0,255] and is an integer.

INTRODUCTION

As the development of modern information technology,


there are great amount of medical images generated every day.
How to use these images to help to diagnose is a very
important issue. Content-Based Medical Image Retrieval
(CBMIR) is the application of CBIR technology in medical
field. When content-based medical image retrieval technology
describes the images content, it is always extract images
characteristics of focus such as color, texture, shape and spatial
relation [1] to form images low-level feature vector as the
basis of making index and matching. Since there are certain
gaps between the description of these low-level features to
medical image focus and the description of doctors diagnose,
it is always can not get satisfied results directly use these lowlevel features as retrieval basis. Therefore, it is necessary to
find some kind of mapping relationship between images lowlevel visual features and high-level semantic information as to
let the retrieval results more in accordance with the doctors
diagnose. This paper in allusion to gastroscopic images, first
extract images color histogram and color autocorrelogram of
low-level features, and then use latent semantic indexing
technology to realize low-level features mapping into highlevel semantic features. The experimental results according to
the prototype system show that the approach proposed in the
paper is effective to gastroscopic image retrieval.

Then the color histogram vector can be expressed as:


H = {h[0], h[1], " , h[ 255]} , where h[i ] indicating the
percentage taken up by the pixels which color value is i after
quantization and clustering in the image.
B. Color correlogram [4]
A color correlogram expresses how the spatial correlation
of color changes with distance.
Let I be an n n image. The color in I are quantized into
m colors c1 , c 2 , ", cm . For a pixel p = ( x, y ) I , let I ( p )
denote its color. And the distance between pixels p1 = ( x1 , y1 )
and p2 = ( x 2 , y 2 ) as: p1 p 2 = max{x1 x2 , y1 y 2 }.
The correlogram of I is defined for i, j [m], k [d ] as:

c( k,c) ( I ) =
i

Pr [ p2 I c

p1I ci
p2 I

(1)

| p1 p 2 = k ]

To compute the correlogram, it needs to compute the


following formula:

The project is sponsored by the national natural science foundation of


China under the grant No. 60472099 and Ningbo Natural Science Foundation
of grant 2006A610017 and 2004A610004.

978-0-7695-3336-0/08 $25.00 2008 IEEE


DOI 10.1109/CSSE.2008.1457

LOW-LEVEL FEATURES

c(k,c) ( I ) = p1 I ci , p 2 I c j | p1 p 2 = k
i

561

(2)

The correlogram counts as: ( k ) ( I ) =


ci , c j

c(ik,c) j ( I )

weight gi : I ij = lij g i . Where, local weight is the significance of


the feature item wi in the image I j ; global weight is the

, where

8khci ( I )
.
The
denominator
is
the total number
hci ( I ) = n Pr [ p I Ci ]

significance of the feature item wi in the total image database.

pI

There are already many weighting method. Some


researchers indicate that local weight using logarithmic item
frequency and global weight using entropy will get the
optimum performance. Here, we adopt this method to
calculate:

of pixels at distance k from any pixel of color ci . In order to


reduce the space complexity, the autocorrelogram is proposed,
which captures spatial correlation between identical colors only
and is defined as: c( k ) ( I ) = c( ,kc) ( I ) . In our experiment, we
calculate the autocorrelogram by the distance of 1.
III.

Local weight (logarithmic item frequency): lij = log(1 + f ij ) .

LATENT SEMANTIC INDEXING [5]

Global weight (entropy): g = 1 +


i

In text retrieval field, LSI takes the SVD on the word-text


matrix, and gets the first maximal k singular values and their
corresponding singular vector to construct a new matrix to
approximatively express the word-text matrix. As the new
matrix has removed noise, reduced the original feature
dimension, it has more excellent retrieval performance thereby.

1
log M

fij

fij

F log F .
j =1

Where f ij indicate feature item wi s occurrence frequency


in the image I j , Fi indicate feature item wi s occurrence
frequency in the total image set.
C. Reduce matrixs rank using singular value decomposition
Reduce matrix A s rank using SVD has fairly good
mathematics character: be a given k , the k -rank
approximation of A is the minimum change of A .

It is a worth deep studying problem that how to better make


use of LSI technology on image retrieval field. Extend the LSI
technology to image retrieval field, and the word-text matrix is
corresponding to semantic-image matrix. This paper apply LSI
technology to gastroscopic images color histogram and color
autocorrelogram, and contrast the performance of normalizing,
term weighting and singular value decomposition before and
after. The results show that the method is indeed effective.

The definition of SVD of matrix A ( m n ) is as: A = U V T ,


where U is a m m orthogonality matrix, V is a n n
orthogonality matrix. The singular value of matrix A is
arrange according to descending ( 1 2 " r , where
r = rank ( A) ) as to form diagonal matrix ( m n ).

A. Normalization [6]
The purpose of normalization is to let each component of
feature vector get the same importance.

Suppose the foremost k left singular vector of U compose


matrix U k ( m k ), the foremost k right singular vector of V
compose matrix Vk ( n k ), and the maximum k singular values
compose diagonal matrix k ( k k ). Matrix Ak is defined as:
Ak = U k kVkT .

Suppose there are M images in the image database, and


each image has K dimension feature vector, so the m th
images feature vector can be set as: Vm = [Vm1 ," ,Vmk ," , VmK ] .
The all images K dimension feature vector will form a
matrix: v = [vmk ] ( m = 1,", M , k = 1,", K ) .

The rank of matrix Ak is k , and the above formula could be


regarded as the k -rank approximation matrix of matrix A .

Suppose the column vector vk is a Gauss progression, first


calculate its mean k and standard deviation k , and then
classify each value in the progression into [ 1, +1] range use
formula: v = vmk k . Generally, we normalize the value
mk
3 k
into [0,1] interval: vmk = vmk + 1 .
2

The decomposition could be viewed as Fig. 1:

m
Figure 1.

m
Ak

Uk

k
k

k
VkT

Reduce matrixs rank using singular value decomposition.

By reducing matrixs rank using SVD, it could remove


much noise. But if the rank is too small, it will lose important
information. How big the rank should be choose is a problem
which always decided by experience and experiment according
to different databases.

B. Term weighting
In text retrieval field, researchers always use term
weighting technology to set the index item different
significance so as to improve the performance of retrieval
system. Apply this technology on to image retrieval field,
suppose there are M images I1 , I 2 ," , I M , we extract K feature
items w1 , w2 ," , wK . In the image I j , the feature item wi s

D. Similarly metric
We use cosine distance to measure the distance between
query images semantic vector q and the images semantic

weight I ij can be the product of local weight lij and global

562

vector in the database. If semantic-image matrix A have


column vector a j , j = 1, 2," , d , the distance is as follows:
D(a j , q ) = 1 cos(a j , q ) = 1

IV.

aTj q

retrieval. We establish a prototype system and designed some


experiments based on it:
(1): Retrieval based on color histogram;
(2): Retrieval using normalization based on (1).
(3): Retrieval using term weighting based on (2).
(4): Retrieval using singular value decomposition based on
(3), trying different k to compare retrieval performance.
(5): Compare above results to the results of retrieval based
on color autocorrelogram;

(3)

aj q

LOW-LEVEL FEATURES MAPPING INTO HIGH-LEVEL


SEMANTIC

The difficulty of using LSI technology in image retrieval is


how to use images low-level features to replace word (term)
in the text retrieval. According to Section 2, we can see that the
color histogram describe the frequency of some color appears
in the image just as word frequency in the text retrieval. By
this way, we can use LSI technology onto color histogramimage matrix to implement image retrieval based on semantic.

System returns 15 images that are most similar with the


query image during every query. The experiment is provided:
Judge that if two images are similar or not is according to if
they have the same focus region or not.
As to system evaluation, we use the retrieval precision and
ranking measures (average-r, average-p) as parameters.

Suppose there are M images I1 , I 2 ," , I M in the image


database, the algorithm is as follows:

The retrieval result interface is show as Fig. 2.

(1) Calculate all images color histogram to form matrix


AN M . Where N is the dimension of color histogram.
(2) Normalize matrix AN M after transpose.
(3) Weight the result of (2).
(4) Take SVD on the result of (3), set appropriate k to
ignore redundancy data and noise, and then compose a new
k rank matrix.
(5) Retrieval image using cosine similarity distance.
(6) Use above (1)-(5) steps to color autocorrelogram as
well.
V.

Figure 2. One retrieve result interface.

In the experiment, we choose 30 images with cancer from


database as the query images, and then using above mentioned
techniques to retrieval, finally calculate the average value.

EXPERIMENTAL RESULTS AND DISCUSSIONS

In the experiment, we adopt 1345 gastroscope images, in


which 169 images with cancer and others not. By analyzing,
we find that intuitionist discrimination between with cancer
images and without cancer images are their color and color
spatial distribution. Therefore, we make use of LSI technology
based on color histogram and color autocorrelogram to
TABLE I.

When retrieval using SVD base on color histogram after


normalization and weighting, the average retrieval performance
changing as different k are show as Tab. . And Tab. is
based color autocorrelogram.

RETRIEVAL RESULTS OF COLOR HISTOGRAM USING LSI WITH DIFFERENT k

10

15

20

25

30

35

40

45

Precision

75.99%

82.88%

83.55%

82.00%

83.33%

83.99%

83.32%

83.77%

83.11%

Average-r

11.14

9.20

7.85

7.50

7.32

7.31

7.38

7.27

7.31

Average-p

0.810

0.884

0.886

0.895

0.899

0.900

0.898

0.900

0.899

50

60

80

100

120

140

160

200

256

82.66%

83.10%

83.11%

83.11%

83.11%

83.11%

83.11%

83.11%

83.11%

7.37

7.39

7.45

7.47

7.46

7.46

7.46

7.46

7.46

0.896

0.896

0.894

0.894

0.894

0.894

0.894

0.894

0.894

TABLE II.

RETRIEVAL RESULTS OF COLOR AUTOCORRELOGRAM USING LSI WITH DIFFERENT k


10
15
20
25
30
35
40

Precision

58.90%

83.11%

75.77%

73.78%

70.89%

68.23%

67.11%

67.11%

68.89%

45

Average-r

11.24

6.40

7.14

7.55

7.92

8.12

8.51

8.63

8.41

Average-p

0.741

0.919

0.880

0.859

0.837

0.821

0.809

0.802

0.808

50

60

80

100

120

140

160

200

256

68.45%

68.00%

67.34%

67.77%

67.12%

67.12%

67.12%

67.12%

67.12%

8.59

8.48

8.48

8.59

8.67

8.72

8.69

8.69

8.69

0.805

0.805

0.804

0.794

0.793

0.793

0.795

0.795

0.795

563

As to color histogram using LSI, when k is 30 the average


precision is getting the maximum 83.99%. And when k is
greater than 100, all the retrieval performance is keep still. It
could be explained that the singular values behind 100 are all
redundancy data and can be compressed away. There are some
noises in the data between 30th singular value and 100th
singular value, it leads to fluctuation with the retrieval
performance therefore. As to color autocorrelogram using LSI,
when k is 10 the average precision is getting the maximum
83.11%. And when k is greater than 140, all the retrieval
performance is keep still. We can observe by contrast that
when retrieval based on color histogram using LSI and color
autocorrelogram using LSI reached the maximum precision,
the k is different (the former is 30, the latter is 10), but the
maximum precision is some what contiguous (the former is
83.99%, the latter is 83.11%).

According to Tab. and Tab. , we can get the average


precision changing figure as different k based on color
histogram and color autocorrelogram using LSI (Fig. 3):
col or aut ocor r el ogr am

20
0

14
0

Dimension k

10
0

60

45

35

25

15

Precision

col or hi st ogr am

1
0. 8
0. 6
0. 4
0. 2
0

Figure 3. Precision of color histogram using LSI and color


autocorrelogram using LSI with different k .

We can observe from Tab. , Tab. and Figure 3 that


setting different k will have certain effect on retrieval results.
TABLE III.

All experiments results are show as Tab. and Fig. 4:

STATISTICS OF EXPERIMENT RESULTS

Color histogram

Precision

Raw data

Normalize

Normalize
Weighted

Normalize, Weighted,
SVD(k=30)

Raw data

Normalize

Normalize
Weighted

Normalize, Weighted,
SVD(k=10)

56.23%

70.21%

83.11%

83.99%

56.67%

47.78%

67.12%

83.11%

Average-r

12.86

8.01

7.46

7.31

9.74

12.34

8.69

6.40

Average-p

0.689

0.849

0.894

0.900

0.726

0.635

0.795

0.919

Pr eci si on

col or hi st ogr am
1
0. 8
0. 6
0. 4
0. 2
0

Color autocorrelogram

connected voluntarily together. As to gastroscopic images, by


using LSI, the semantic index items including cancer focus
connected together, and the items not including cancer focus
connected together, so the retrieval performance is improved.

col or aut ocor r el ogr am

VI.
Raw dat a

Nor mal i zed

Nor mal i zed, Wei ght ed

CONCLUSIONS

This paper in allusion to gastroscopic images, make use of


LSI technology to implement image retrieval which based on
its semantic information. The experimental results according to
the prototype system show that the approach proposed in the
paper could improve the retrieval performance greatly. But this
improving has a limit. How to break through this limit needs to
introduce other retrieval mechanism and technology, and it will
be the content of our next research.

Nor mal i zed, Wei ght ed wi t h


SVD

Figure 4. Precision contrast of retrieval results.

We can observe from Tab. and Fig. 4 that as to color


histogram, normalizing and term weighting are having
important effect on improving retrieval performance. And the
result of singular value decomposition based on that is not so
obvious. As to color autocorrelogram, the normalizing leads to
retrieval performance drop a lot, but term weighting and SVD
are having important effect on improving retrieval
performance. Besides, we can observe another phenomenon
from Fig. 4 that no matter color histogram or color
autocorrelogram, by using normalizing, term weighting and
SVD, the ultimate retrieval results both tend to an approximate
value. It could be explained that as to these two low-level
features using LSI technology, the final retrieval performance
tend to a fixed extremum. It also could be said that using LSI
technology could not improve retrieval performance infinitely;
it will be confined by image itself and other aspects affections.
But it is obvious that by using LSI technology, no matter
precision or other two measures, the retrieval performance is
greatly improved. So the LSI technology has the function of
being able to make the interrelated semantic index item be

REFERENCES
[1]
[2]

[3]

[4]
[5]
[6]

564

Mustafa O, Ediz P. A color image segmentation approach for contentbased image retrieval. Pattern Recognition, 2007.40(4):1318-1325
Naoto K, Yasuo M. Database retrieval for similar images using ICA and
PCA bases. Engineering Applications of Artificial Intelligence,
2005.18(6):705-717
Fang YCZ, Bang TM, and Chuan KS. Endoscope diagnosis and
differential diagnosis map. LiaoNing Science and Technology
Publishing House,2003.7
Adam W, Peter Y. Content-based image retrieval using joint
correlograms. Multimedia Tools and Applications, 2007.34(2):239-248
Zhao R, Grosky W I. Negotiating the semantic gap: From feature maps
to semantic landscapes. Pattern Recognition, 2002, 35:593-600
Tai XY, Bei YE. Introduction to information retrieval technology.
BeiJingScience Press,2006

Anda mungkin juga menyukai