Anda di halaman 1dari 9

ROTATIONAL INVARIANT OBJECT DETECTION

WITH YOLOv3

Phanindra Dheeraj Varma1, Bharat Giddwani1,


Mohana Murali Dasari2, Dr. Gorthi R. K. Sai Subrahmanyam2
1
National Institute of Technology, Raipur, CG- 492010, India
{dheeraj.lucifer99, bharatgiddwani}@gmail.com
2
Indian Institute of Technology, Tirupati, A.P.-517506, India
dmmiitkgp@gmail.com, rkg@iittp.ac.in

Abstract. Efficient and accurate object detection has been an important goal in
the advancement of computer vision systems. With the advent of deep learning
techniques, the accuracy of object detection has increased drastically. Robust and
faster detection of multiple objects in an image is now possible using algorithms
like Faster-RCNN [1], SSD [2] and YOLO [3]. These are the modern, efficient
and fast techniques for detecting objects in an image if the input image is aligned
at an orientation same as that of training images, but is not appropriate for the
images / objects at orientation diverted from that of training images. In such
cases, a preprocessing method based on principal components is proposed which
aligns the image according to the orientation of Eigenvectors of the brightest
pixel distribution after segmentation. This preprocessing step outputs, four ori-
entations of the image on which we use YOLOv3 to achieve rotation invariant
detection of objects. Eventually, the principal image with orientation near to that
of training images is selected based on a decision criterion. This proposed col-
laboration is named as (Orientation Correction Networks) OCN-YOLOv3. This
method does not require changes in network architecture or data augmentation.
It is inspired from the orthogonal transformation of principal components [4] and
it calculates the orientation of the image approximately. The application of the
proposed method is demonstrated on PASCAL VOC 2012 dataset [5] and it is
observed that the method, through using pre-trained YOLOv3, can be applied
much more effectively for rotation invariant image parsing, than the base
YOLOv3.

Keywords: Principal Components, Eigenvectors, Rotation Invariant, OCN-


YOLOv3.

1 Introduction

In Deep Learning, a Convolutional Neural Network (CNN or ConvNet) is a class of


deep, feed-forward neural networks used mostly in the analysis of visual imagery. One
of the advantages of CNNs is its translation equivariant property provided by weight
sharing. However, it cannot deal with a rotation transformation of input images. Though
2

the features extracted by CNN are invariant to small changes in shift and scale, they are
sensitive to rotations in the input image. To overcome this problem, the classical ap-
proach is to increase training data size by including rotated versions of original images.
This is referred as data augmentation [6][7]. It thus also increases training time by the
number of rotations introduced in the training data.

For images containing single objects, few other works such as Spatial Transformer Net-
works (STN) [8], TI-pooling [9], Oriented Response Networks (ORN) [10] with Active
Rotating Filters (ARF), RIFD-CNN [11], etc. have also been done. These techniques
either use data augmentation in training phase or modify the architecture or training
process which increases training complexity. For very large datasets with millions of
training images, such as that provided in ILSVRC [12], data augmentation is practically
not possible.

Fast rotation invariant object detection [13] with gradient based detection models in-
troduces the concept of training multiple replicas of specific models, each with a dif-
ferent orientation. At test time, a rotated map is formed which contains the orientations,
information on a specific location, obtained from dominant orientation. Dominant ori-
entation is calculated using the SURF algorithm on Haar features. Based on dominant
orientation, only particular models among all trained models are evaluated. Though
they claim high speed in evaluation, but training complexity is similar to data augmen-
tation as different orientations are used to train a specific model.

2 YOLOv3

YOLO stands for You Only Look Once, is one of the faster object detection algorithms
out there, which uses features learned by a deep convolutional neural network to detect
an object. YOLO makes use of only convolutional layers, making it a fully convolu-
tional network (FCN). Though it is no longer the most accurate object detection algo-
rithm, it is a very good choice when real-time detection is needed, without much com-
promising on the accuracy with respect to state-of-art. Later, the YOLO9000 [14] came
into existence. For its time, it was the fastest, and also one of the most accurate algo-
rithms. However, with the advent of algorithms like RetinaNet and SSD, YOLO9000
was no longer the fastest. That speed has been traded off for increase in accuracy in
YOLOv3 [15]. While the earlier variant ran at 45 FPS on a Titan X, the current version
clocks about 30 FPS. This has to do with the increase in complexity of underlying ar-
chitecture called Darknet-53.

3 Need of Rotational Invariance

All the pictures in the world are not in the same orientation. They differ from each other
in terms of alignment as the camera might rotate when capturing them. Moreover, few
scenes obtain significant view when captured in landscape mode by placing view of
3

camera horizontally, while some scenes are to be captured in portrait mode in which
view of the camera is vertical. Eventually, if all these images are to be fed to a single
detector, then all of them have to be brought into a single orientation as CNNs are
unable to disentangle transformations to image in terms of planar rotation. If the detec-
tion is to be accurate on rotated images, then CNNs should be invariant to these planar
rotations so that they should output the same detection on a rotated image as on an
original image.

3.1. Behavior of YOLOv3 for rotated images

YOLOv3 is an extremely fast object detection algorithm. It consists of a network called


Darknet-53, trained on images at a particular orientation. So, it is able to detect objects
at that orientation only. If rotated images are fed to it, then it will make incorrect pre-
dictions. Performance of YOLOv3 for original and rotated images is shown in Fig.1.

Fig.1. YOLOv3 Performance on original & rotated images.

Thus, we observe that YOLOv3 has failed in object detection when rotated images are
passed to it as it is only trained on images at a certain orientation. However, to tackle
object detection in rotated versions of the scene, we need to train the algorithm on rotated
versions of training data images which is an extremely large task as it involves not only
rotation of images and also collection of ground truths for every image. The training time
will also be drastically increased due to enormous training data size. Thus, there is a
need for an efficient method which could bring rotational invariance to CNNs.

4 Proposed method: Orientation Corrected Network with


YOLOv3 (OCN-YOLOv3)
YOLOv3 makes predictions very accurate only when the image is in training orientation.
Therefore, there is a need for providing rotational invariance to CNNs for efficient object
detection. OCN-YOLOv3 is proposed which is based on estimation of the angle of
rotation approximately using principal components [16]. Hence, the proposed method is
4

introduced with YOLOv3 so that it can orient the input image nearly in the orientation
of trained images and hence can make accurate predictions.
It is observed from the behaviour of YOLOv3 in case of rotated images that it gives high
objectness score for images oriented at an angle close to that of training images. When
rotated images are passed through YOLOv3, it makes incorrect detections and even if it
detects the object correctly, the objectness score of that will be low as compared to that
of proper oriented image. The nearly oriented images have a high score for each object,
whereas large deviated images will have misclassified objects or correctly classified with
lower scores.
4.1. Training phase
Training of OCN-YOLOv3 involves no data augmentation or rotation of images. It is
trained in the same way as normal, with only one particular orientation. No modifications
are made in the architecture , and thus pre-trained model is available, the training can be
completely avoided.
4.2. Testing phase
The proposed method is implemented during testing. The input image is converted to
grayscale and Otsu’s thresholding is applied to convert it into Black & White image.
Then the locations of bright pixels are extracted into a matrix from the segmented image
and the covariance matrix of these coordinates is calculated. Then eigenvectors of the
image are determined using this covariance matrix. By this, we obtain two eigenvalues
and corresponding eigenvectors which are called principal components. The eigenvector
corresponding to largest eigenvalue is the first principal component. The above
explained method is depicted in a flowchart in Fig.2.

Fig.2. Preprocessing steps Fig.3. Test phase of Rotational invariant detection


withYOLOv3. collaborating with YOLOv3.

By observation, we note the following points:


 The direction of the first principal component of images of a certain class is nearly
same [16].
5

 The first principal component will be nearly horizontal for original images with as-
pect ratio width > height and nearly vertical for images with aspect ratio width <
height respectively.
 The eigenvectors of an image and its 180o rotated version are in the same direction.
So, the estimated principal components may belong to two versions of an image.
So, after estimating the principal components of the input image, we perform the orthog-
onal transformation [4] [17] of them. Then, they are obtained in exactly horizontal and
vertical positions. We consider these as reference vectors. By calculating the angle be-
tween the first principal component of the input image and these reference vectors, we
obtain two angles of rotation, say α and β. Then, with consideration of the above men-
tioned three observations, we rotate the input image by α, 180o + α, β and 180o + β. This
is the end stage of the preprocessing method.
Hence, we obtain four images at the end of pre-processing step. These four images are
fed to YOLOv3 to obtain detections and the principal image among them is selected
based on a decision criterion which is discussed in the next section. The following Fig.3
demonstrates the flow of the process explained above.

Fig.4. Decision Logic for rotation invariant image annotation

4.3 Decision Criteria


Based on the behaviour of YOLOv3 for rotated images, a decision criterion based on
class frequency and objectiveness score is developed for selecting the principal image
among the four, which gives the correct annotation of the input image. Motivated from
the work [18], multiple orientations of the same image are fed to YOLOv3.

Observations from detections are as follows:

 The detections will be accurate in the image having orientation closer to that of train-
ing images. So, the principal image will have good predictions and higher scores.

 The object which is not in the principal image might be detected in any of other three
other orientations. But the score of that detection will be anyhow less than that of at
least one detection in the principal image.
6

 The objects present in the principal image can also be identified in other orientations
with low scores. Thus, the number of times the true detections appear will be more
than or equal to false detections.

Thus, we can count the number of times an object gets detected in all the four orienta-
tions and this gives the likelihood. Then, the score of the object can be captured in each
orientation. Even if the objects in the principal image are not detected in other three,
one of them will have the highest score among all annotations given on all four orien-
tations. So, the annotation with maximum likelihood or objectness score forms the base
for selecting the principal image. As shown in Fig.4, the bicycle is detected in all four
orientations. Thus, it is the class with maximum frequency. The objects score of bicycle
is found to be highest in the third rotated image. Thus, this is the principal image since
it has the class with maximum likelihood. The method also works for finer rotations in
the image, other than some discrete rotations. As shown in Table 1, the input image is
in a random rotated version. By applying OCN-YOLOv3, final four orientations with
predictions are obtained. Among those, cow and horse were the classes with maximum
frequency of occurrence. So, we look for the maximum objectiveness score among the
two classes to detect the principal image. Hence, the second image is the principal im-
age based on maximum likelihood and prediction score of the respective class.

In brief, we find the class which occurs the maximum number of times in the annota-
tions produced on all four images. Then we find the scores of that particular class in
every image, if detected. The maximum of those scores is selected and the image having
an annotation with that score is the principal image. If there is no repetition of any class,
then the annotation with the highest score in all of the detections on four images is
selected and the respective image is principal image. If two or more classes have max-
imum likelihood, then we select the one which produces a maximum score [18].

5 Simulation Results

The dataset provided by PASCAL VOC challenge 2012 (consists of 20 classes) has
been used for test analysis. YOLOv3 is pre-trained on MS COCO dataset with 80 clas-
ses. The test images are rotated at random angles (specifically, 0o, 90o, 180o and 270o)
to create rotated test data. The original image is rotated around 137o to obtain that input
image in Table 1. The OCN-YOLOv3 and YOLOv3 are compared in terms of accuracy.
The accuracy is calculated by summing the detection scores and dividing the sum by
the number of images of that class. The comparison is displayed in the Table 2. The
angles mentioned at the top of the table in Table 2 are the rotation angles through which
original (good) images are rotated. This is done for experimental purposes. Then , after
the proposed method is applied to those rotated images and results are obtained. The
results are also accurate for finer rotations as shown in Table 1. The mean accuracies
calculated in the Table 2 are represented in a column-chart in Fig.5.
7

120

Mean Accuracy 100

80

60

40

20

0
0 90 180 270

Angle of Rotation

OCN-YOLOv3 YOLOv3

Fig.5. Column chart representing mean accuracies.

Table 1. Decision Logic explained for finer rotation invariant image annotation.

Input
Image

Angle Φ1 = α Φ2 = 180o + α Φ3 = β Φ4 = 180o+β

Rotated
Images

Cow-0.9985
Detec- Horse – 0.9545 Person-0.9943 Cow-0.8691 Horse – 0.7903
tions Person-0.9916
Class
Frequency Cow:2,
Horse:2,
Person:1
Frequent
Class Cow, Horse

Max. 0.9985 from Φ2 for Cow


Score of 0.9545 from Φ1 for Horse
Frequent
Class
Principal Φ2
Image (Since it has the maximum score for frequent class)
8

Table 2. Comparison between OCN-YOLOv3 and YOLOv3 in terms of accuracy.

0o 90o 180o 270o


Class
OCN- YOLO OCN- YOLO OCN- YOLO OCN- YOLO
YOLOv3 v3 YOLOv3 v3 YOLOv3 v3 YOLOv3 v3
Aero- 98.11 99.86 98.13 64.03 98.11 99.86 98.13 64.03
plane
Bicycle 97.43 99.91 99.60 62.86 97.43 99.91 99.60 62.86

Bird 98.69 98.37 98.90 65.80 98.69 98.37 98.90 65.80

Boat 69.47 97.09 75.05 04.79 69.47 97.09 75.05 04.79

Bottle 78.69 97.84 82.77 61.01 78.69 97.84 82.77 61.01

Bus 94.88 99.94 94.86 09.02 94.88 99.94 94.86 09.02

Car 99.84 99.27 99.64 05.98 99.84 99.27 99.64 05.98

Cat 94.71 95.98 94.65 90.45 94.71 95.98 94.65 90.45

Chair 78.85 95.23 75.08 17.04 78.85 95.23 75.08 17.04

Bike 99.81 99.94 98.63 56.85 99.81 99.94 98.63 56.85

Mean 91.04 98.34 91.73 43.78 91.04 98.34 91.73 43.78

6 Conclusion

Observing the results obtained from OCN-YOLOv3 and YOLOv3, we can say that the
proposed method is an elegant and robust detection method which can make accurate
predictions irrespective of the sensitive nature of the CNNs of rotation. The prominent
features of proposed method are training simplicity, cost and computational efficient
and no architecture modification.

This method also works for finer degrees of rotation in the image acquired by the cam-
era, by passing only four resulting orientations into the network. In the method of in-
corporating rotational invariance to the CNN using multiple instances of network [18],
the image has to be fed into the network for N times, where N depends on the degree
of rotation. It can go beyond four for much finer degrees of rotation. Meanwhile, the
proposed method uses only four instances every time, even for smaller and finer rota-
tions. Thus, the proposed method is much simplified and time-efficient.

7 References
1. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-
time object detection with region proposal networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence (Volume: 39, Issue: 6, June 1 2017)
9

2. W. Liu, D. Anguelov, D. Erhan, S. Christian, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: single
shot multibox detector. In ECCV, 2016.
3. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi: You Only Look Once:
Unified, Real-Time Object Detection. arXiv preprint arXiv:1506.02640, 2015. http://pjred-
die.com/yolo/
4. WEI Yi, S. Marshall. Principal Component Analysis in application to object orientation.
Geo-spatial Information Science, Volume 3, No. 3, p.76-78, September 2000.
5. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zis-
serman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Com-
puter Vision, 88(2):303-338, 2010.
6. Fok Hing Chi Tivive and Abdesselam Bouzerdoum. Rotation invariant face detection using
convolutional neural networks. In International Conference on Neural Information Pro-
cessing, pages 260-269. Springer, 2006.
7. Sander Dieleman, Kyle W Willett, and Joni Dambre. Rotation-invariant convolutional neu-
ral networks for galaxy morphology. MNRAS 450, 1441-1459 (2015).
8. Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial
transformer networks. Advances in Neural Information Processing Systems, pages 2017-
2025, 2015.
9. D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-pooling: Transformation-in-
variant pooling for feature learning in convolutional neural networks. In 2016 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pages 289-297, 27-30 June
2016.
10. Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Oriented Response Networks.
CoRR, abs/1701.01833, 2017.
11. G. Cheng, P. Zhou, and J. Han. Rifd-cnn: Rotation-invariant and fisher discriminative con-
volutional neural networks for object detection. In 2016 IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 2884-2893, 27-30 June 2016.
12. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale
visual recognition challenge. International Journal of Computer Vision, 115(3):211-252,
2015.
13. Floris De Smedt and Toon Goedem. Fast rotation invariant object detection with gradient
based detection models. In Proceedings of the 10th International Conference on Computer
Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015), pages 400-407.
INSTICC, ScitePress, 2015.
14. J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In Computer Vision and Pat-
tern Recognition (CVPR), 2017 IEEE Conference on, pages 6517-6525. IEEE, 2017.
15. Joseph Redmon, Ali Farhadi: University of Washington. YOLOv3: An Incremental Im-
provement. arXiv, 2018. https://pjreddie.com/publications/
16. Swetha V C, D Mishra, R K Subrahmanyam Gorthi. Scale and Rotation Corrected CNNs
(SRC-CNNs) for Scale and Rotation Invariant Character Recognition. Indian Conference on
Computer Vision, Graphics and Image Processing (ICVGIP), 2018.
17. Hawrra Hassan Abass, Firas Mahdi Muhsin Al-Salbi: Kerbela University, Kerbela, Iraq.
Rotation and Scaling Image Using PCA. Vol. 5, No. 1; January 2012.M. Young, The Tech-
nical Writer's Handbook. Mill Valley, CA: University Science, 1989.
18. Haribabu, Ayushi Jain, Swetha V C, Deepak M, Sai Subrahmanyam Gorthi. Incorporating
Rotational Invariance in Convolutional Neural Network Architecture, Pattern Analysis and
Applications (Springer), pp 1-14, Feb 2018.

Anda mungkin juga menyukai