Anda di halaman 1dari 12

sensors

Article
Deep Count: Fruit Counting Based on Deep
Simulated Learning
Maryam Rahnemoonfar * and Clay Sheppard
Department of Computer Science, Texas A&M University-Corpus Christi, Corpus Christi, TX 78412, USA;
csheppard1@islander.tamucc.edu
* Correspondence: maryam.rahnemoonfar@tamucc.edu

Academic Editor: Vittorio M. N. Passaro


Received: 18 February 2017; Accepted: 7 April 2017; Published: 20 April 2017

Abstract: Recent years have witnessed significant advancement in computer vision research based
on deep learning. Success of these tasks largely depends on the availability of a large amount of
training samples. Labeling the training samples is an expensive process. In this paper, we present
a simulated deep convolutional neural network for yield estimation. Knowing the exact number of
fruits, flowers, and trees helps farmers to make better decisions on cultivation practices, plant disease
prevention, and the size of harvest labor force. The current practice of yield estimation based on the
manual counting of fruits or flowers by workers is a very time consuming and expensive process
and it is not practical for big fields. Automatic yield estimation based on robotic agriculture provides
a viable solution in this regard. Our network is trained entirely on synthetic data and tested on real
data. To capture features on multiple scales, we used a modified version of the Inception-ResNet
architecture. Our algorithm counts efficiently even if fruits are under shadow, occluded by foliage,
branches, or if there is some degree of overlap amongst fruits. Experimental results show a 91%
average test accuracy on real images and 93% on synthetic images.

Keywords: deep learning; agricultural sensors; simulated learning; yield estimation

1. Introduction
Recent years have witnessed enormous advancement in the computer vision research based on
deep learning. A variety of vision-based tasks, such as object recognition [16], classification [79],
and counting [1012], can achieve high accuracy. Success of these advanced tasks largely depend on the
availability of a large amount of training samples. Labeling the training samples is an expensive process
both in terms of time and money. Generating synthetic data for training can provide an alternative
solution. One of the objectives of this research is to reduce the overhead of labeling the training
samples for object counting problem by creating synthetic dataset for training. This research achieves
a fast yield estimation based on deep simulated learning. Accurate yield prediction helps farmers
to improve their crop quality. Moreover, it helps in reducing the operational cost by making better
decisions on the intensity of crop harvesting and the labor required.
There are various challenges faced by computer vision algorithms for counting fruits for yield
estimation, namely, illumination variance, and occlusion by foliage, varied degree of overlap amongst
fruits, fruits under shadow, and the scale variation. To address these challenges, in this research
we developed a novel deep learning architecture which counts objects without detecting them.
Our method estimates the number of objects explicitly from the glance of the entire image. In this
way, it reduces the overhead of object detection and localization. Our network consists of several
convolution and pooling layers in addition to modified Inception-ResNet. The modified version of the
Inception-ResNet helps us to capture features in multiple scales. The framework of our approach is
depicted in Figure 1.

Sensors 2017, 17, 905; doi:10.3390/s17040905 www.mdpi.com/journal/sensors


Sensors 2017, 17, 905 2 of 12

version of the Inception-ResNet helps us to capture features in multiple scales. The framework of
Sensors 2017, 17, 905 2 of 12
our approach is depicted in Figure 1.

Figure 1. The
Figure 1. The framework
framework of
of our
our research.
research.

The main
The main advantage
advantage of of this
this work
work is is that
that thousands
thousands of of annotated
annotated data data onon real
real images
images are are not
not
necessary for training. The network was trained using synthetic images
necessary for training. The network was trained using synthetic images and tested on real images and and tested on real images
and
it it works
works efficiently
efficiently with 91% with 91% accuracy
accuracy on real The
on real images. images. The proposed
proposed methodology methodology
works efficientlyworks
efficiently even if there is illumination
even if there is illumination variance in the images. variance in the images.
The following
The following are are the
the contributions
contributions of of this
this work:
work:
AA novel
novel deep
deep learning
learning architecture
architecture for for counting
counting fruits
fruits based
based on on convolutional
convolutional neural
neural networks
networks
(CNN) and a modified version of Inception-ResNet
(CNN) and a modified version of Inception-ResNet is presented. is presented.
Wedeveloped
We developeda asimulation-based
simulation-based learning
learning method,
method, whichwhich is trained
is trained on simulated
on simulated data but data but
tested
tested on
on real data. real data.
Our approach
Our approach is is robust
robust to to occlusion,
occlusion,variation
variationin inillumination
illuminationand andscale.
scale.
Our algorithm
Our algorithmworks works in less
in less thanthan a second,
a second, which which
is enough is to
enough to for
be useful be real-time
useful for real-time
application.
application.
2. Related Work
2. Related Work
In typical image classification process the task is to specify the presence or absence of an object
In typical
but counting image classification
problem one requires to process
reasonthe how task is toinstances
many specify the of anpresence
object areor absence
present in ofthean object
scene.
but counting problem
The problem arisesone requires
in several to real-world
reason how many instances
applications, such of an object
as cell counting are inpresent in the
microscopic
scene. The
images [13],counting problem in
wildlife counting arises
aerialinimages
several[14],real-world applications,
fish counting [15], and such
crowdas cell counting[16]
monitoring in
microscopic
in surveillance images
systems. [13],Thewildlife
method counting
proposed in byaerial
Kimimages [14],
et al. [17] fish and
detects counting
tracks[15],
movingand people
crowd
monitoring
with the help[16] of ainfixed
surveillance systems.
single camera. TheLempitsky
Later, method proposed
et al. [18] by Kim etaal.
proposed new[17]supervised
detects and tracks
learning
moving people
framework with object
for visual the help of a fixed
counting taskssingle camera. Later,
that optimizes the lossLempitsky
based on et theal.MESA-distance
[18] proposedduring a new
supervised
the learning.learning
Recently, framework
Giuffridafor visual
et al. [19] object
proposedcounting tasks that optimizes
a learning-based approachthe forloss based leaves
counting on the
MESA-distance
in rosette (model)during plants. theTheylearning. Recently, regression
used a supervised Giuffrida modelet al. to [19] proposed
relate a learning-based
image-based descriptors
approach
which for counting
are learned in anleaves in rosette
unsupervised (model)
fashion plants.
to leaf They used a supervised regression model to
counts.
relateTheimage-based descriptors
current practice which
of yield are learned
estimation in anon
based unsupervised
the manual fashion counting to of
leaffruits
counts. or flowers
The current
by workers is very practice of yield estimation
time consuming and expensivebased on the manual
process and it counting of fruits
is not practical fororlarge
flowers by
fields.
workers is very time consuming and expensive process and it
Automatic yield estimation based on robotic agriculture provides a viable solution in this regard. is not practical for large fields.
Automatic
A yield estimation
widely adopted solution for based on robotic
automatic yieldagriculture
estimation provides
is to count a viable
fruitssolution in this
or calculate theregard.
density A
widely
of flowers adopted solution
on images using forcomputer
automaticvision yield algorithms
estimation [2026].
is to count fruits orvision-based
Computer calculate the crop density of
yield
flowers onmethods
estimation images using can becomputer vision algorithms
divided roughly [2026]. Computer
into two categories: (1) region-vision-based
or area-basedcrop yield
methods
estimation
and, methods can methods.
(2) counting-based be dividedInroughly into two
the literature, therecategories:
is an ample (1) region-
amountorofarea-based
work dealing methods
with
region-based methods [2026]. Wang et al. [20] developed a stereo camera automatic crop yield
Sensors 2017, 17, 905 3 of 12

and, (2) counting-based methods. In the literature, there is an ample amount of work dealing with
Sensors 2017, 17, 905 3 of 12
region-based methods [2026]. Wang et al. [20] developed a stereo camera automatic crop yield
estimation system for apple orchards. They captured images at nighttime to reduce the
unpredictable
estimation system natural illumination
for apple orchards.in Theythecaptured
daytime.images Li et atal.nighttime
[21] developed
to reducean thein-field cotton
unpredictable
detection system based
natural illumination in theondaytime.
region-basedLi et al. semantic
[21] developedimage segmentation.
an in-field cotton Ludetection
et al. [22] developed
system based
region-based
on region-based color modeling
semantic image for segmentation.
joint crop and Lu maizeet al.tassel segmentation.
[22] developed Despite a color
region-based wide modeling
attention
to
forregion-based
joint crop andmethodsmaize tassel verysegmentation.
scarce attention has been
Despite a wide paid to counting-based
attention to region-based yield estimation
methods very
methods [27,28].has
scarce attention Linker
beenetpaid al. [27] used color images
to counting-based yieldtoestimation
estimate the number
methods of apples
[27,28]. Linker acquired
et al. [27]in
orchards
used colorunder images natural illumination.
to estimate the number The of drawbacks
apples acquired are direct illumination
in orchards underand colorillumination.
natural saturation,
due
The to which a large
drawbacks number
are direct of false positives
illumination and color weresaturation,
observed. Tabb due to et which
al. [28]adeveloped
large number a method
of falseto
segment apple fruit from video using background modeling.
positives were observed. Tabb et al. [28] developed a method to segment apple fruit from video using
Recently,
background deep learning-based object counting methods are gaining popularity. Segu et al. [10]
modeling.
explored the task
Recently, deepof counting
learning-based occurrences
object of a concept
counting of interest
methods are with
gainingCNN. Xie et al. Segu
popularity. [13] developed
et al. [10]
aexplored
convolutional
the taskregression
of counting network-based
occurrences ofmicroscopy
a concept ofcell counting
interest framework.
with CNN. Zhang
Xie et al. et al. [11]
[13] developed
developed
a convolutional cross-scene
regressioncrowd counting framework
network-based microscopy based celloncounting
deep convolutional
framework. neural Zhangnetworks.
et al. [11]
French
developed et al. [29] also explored
cross-scene crowd counting CNN for counting based
framework fish inona deepfisheries surveillance
convolutional video.
neural Several
networks.
authors
French etexplored deep
al. [29] also learningCNN
explored approaches
for counting for fruit/plant detection
fish in a fisheries and recognition
surveillance [3034].authors
video. Several To the
best of our
explored deepknowledge, there are no
learning approaches forpapers
fruit/plant related to fruitand
detection counting
recognition based on deep
[3034]. To thesimulated
best of
learning; all of thethere
our knowledge, deepare learning-based
no papers related counting methods
to fruit counting rely based
on object on detection
deep simulatedand then count the
learning; all
detected
of the deep instances. Our method
learning-based estimates
counting methodsthe count
rely on of objects explicitly and
object detection fromthen the glance
count the of the entire
detected
image. In this
instances. Ourway, it reduces
method estimates the overhead
the countofofobject objectsdetection
explicitlyandfrom localization
the glance andofitthe
learns explicitly
entire image.
to count. Moreover, the aforementioned techniques are dependent on
In this way, it reduces the overhead of object detection and localization and it learns explicitly to count. a large set of labeled data.
Labeling
Moreover, the thetraining samples istechniques
aforementioned expensive are bothdependent
in terms ofon time and set
a large money. Here data.
of labeled we areLabeling
generating the
synthetic data to reduce the overhead of labeling the training
training samples is expensive both in terms of time and money. Here we are generating synthetic samples for object counting problems.
Although
data to reducetrained theon synthetic
overhead of data, ourthe
labeling method
trainingperforms
samples very
forwell
object oncounting
real data.problems. Although
trained on synthetic data, our method performs very well on real data.
3. Methodology
3. Methodology
3.1. Synthetic Image Generation
3.1. Synthetic Image Generation
Deep learning requires large datasets that are time consuming to collect and annotate. To solve
Deep learning
this issue, we generated requires large datasets
synthetic data tothat trainareourtime consuming
network. The to collect and
training annotate.
parameters To solve
were then
tested on real images. The synthetic images were generated as follows. A blank image oftested
this issue, we generated synthetic data to train our network. The training parameters were then size
on real
128 images.
128 pixelsThe synthetic
is created imagesby
followed were generated
filling the entireas follows.
image with A blankgreenimageandof size 128
brown 128circles
colored pixels
is created
to simulatefollowed
the backgroundby filling and thetheentire
tomato image with
plant, whichgreen areandlaterbrown
blurred colored circles to filter.
by a Gaussian simulate To
the background and the tomato plant, which are later blurred
create the variable-sized tomatoes in the image, several circles of random size are drawn in random by a Gaussian filter. To create the
variable-sized
positions on thetomatoes in the image,thousand
image. Twenty-four several circlesimages ofwere
random size arefor
generated drawn in random
the training set, positions
and 2400
on the image. Twenty-four thousand images were generated for
for the test set. Figure 2 shows the process of generating the synthetic images that were used to the training set, and 2400 for thetrain
test
set. Figure 2 shows the process of generating the synthetic
the network. Synthetic tomato images were generated with some degree of overlap along with images that were used to train the network.
Synthetic in
variation tomato imagesand
size, scale, were generated with
illumination in ordersome todegree of overlap
incorporate along with
the possible variation in
complexities in size,
real
scale, and
tomato images. illumination in order to incorporate the possible complexities in real tomato images.

Figure
Figure 2.
2. Synthetic
Synthetic image
image generation.
generation.
Sensors 2017, 17, 905 4 of 12

3.2. Convolutional Neural Network


CNN is one of the most notable deep learning approaches; it comprises various convolutional and
pooling (subsampling) layers that resembles human visual system [35]. Generally, image data is fed to
the CNN that constitute an input layer and produces a vector of reasonably distinct features associated
to object classes in17,the
Sensors 2017, 905 form of an output layer. Between input and output layers there are hidden
4 of 12

layers in the form of series of convolution and pooling layers followed by fully-connected layers [36].
3.2. Convolutional Neural Network
The training of the network is performed in forward and backward stages based on the prediction
CNN is one of the most notable deep learning approaches; it comprises various convolutional
output and labeled ground-truth. In the backpropagation stage, the gradients of each parameter
and pooling (subsampling) layers that resembles human visual system [35]. Generally, image data is
is computedfed to the CNNthe
based on thatloss cost. an
constitute Allinput
of the parameters
layer will
and produces be updated
a vector based
of reasonably on the
distinct gradients and
features
are updated for thetonext
associated objectforward
classes incomputation. The layer.
the form of an output network learning
Between can
input and be stopped
output after
layers there are sufficient
hidden
iterations of forwardlayersand
in the form of series
backward of convolution and pooling layers followed by fully-connected
stages.
layers [36]. The training of the network is performed in forward and backward stages based on the
3.3. prediction
Inception output and labeled ground-truth. In the backpropagation stage, the gradients of each
Architecture
parameter is computed based on the loss cost. All of the parameters will be updated based on the
Whilegradients and are updated
a convolutional for the
layer next forward
attempts to computation. The network
learn to filter learning can be
simultaneously stopped
with two spatial
after sufficient iterations of forward and backward stages.
dimensions and a channel dimension in a 3D space, the Inception model makes this process easier
and, therefore, it empirically
3.3. Inception Architecture appears to be capable of learning richer representations with less
parameters. The Inception model would independently look at cross-channel correlations and at
While a convolutional layer attempts to learn to filter simultaneously with two spatial
spatial correlations.
dimensions and a channel architecture,
Inception introduced
dimension in a 3D by Szegedy
space, the Inception modelet al. (Inception-v1)
makes this process easier [37], later
refined as and,
Inception-v2
therefore, it[38], Inception-v3
empirically [39]
appears to be and, most
capable recently,
of learning as Inception-ResNet
richer [6], has been
representations with less
one of the parameters.
best-performingThe Inception model
families of would
models independently look at cross-channel
on the ImageNet dataset [40]. correlations
Inspiredandby at
the success
spatial correlations. Inception architecture, introduced by Szegedy et al. (Inception-v1) [37], later
of these models in ImageNet competition, we combined the modified version of Inception-ResNet-A
refined as Inception-v2 [38], Inception-v3 [39] and, most recently, as Inception-ResNet [6], has been
in our CNN onenetwork.
of the best-performing families of models on the ImageNet dataset [40]. Inspired by the success
of these models in ImageNet competition, we combined the modified version of Inception-ResNet-A
3.4. Description
in our of
CNNOur Network Architecture
network.

The neural network


3.4. Description design
of Our NetworkforArchitecture
this research is shown in Figure 3. The first layer of the network
is 7 7 convolution layer followed
The neural network design for 3
bythis 3 max
research pooling
is shown layer3.with
in Figure stride
The first layer2. This
of the convolutional
network
layer mapsis the bands (RGB) in the input image to 64 feature maps using a 7
7 7 convolution layer followed by 3 3 max pooling layer with stride 2. This convolutional layer function.
3 7 kernel
This condensestheinformation
maps 3 bands (RGB) in in the
theinput image to 64
network. feature maps
Reducing theusing a 7 7 kernel
dimensions offunction. This reduces
the image
condenses information in the network. Reducing the dimensions of the image reduces computation
computation time and allows the model to fit into the GPUs memory [41]. Similarly, in order to
time and allows the model to fit into the GPUs memory [41]. Similarly, in order to reduce the
reduce thedimensionality
dimensionality of theof the feature
feature maps, a 1maps, a 1 1layer
1 convolution convolution layer
is used before is used
another before another
convolution
convolution layer
layer of kernel
of kernel 5. 5 5.
size 5 size

Figure 3. The architecture of our network.


Figure 3. The architecture of our network.
Sensors 2017, 17, 905 5 of 12
Sensors 2017, 17, 905 5 of 12

The size
The ofsizetheofobjects in theinimages
the objects varies,
the images so ansoarchitecture
varies, an architecturethat that
can capture features
can capture at multiple
features at
scalesmultiple
is required.
scales is required. For this purpose, we modified the Inception-ResNet-A [6] layer. modified
For this purpose, we modified the Inception-ResNet-A [6] layer. Two Two
Inception-ResNet-A layers follow
modified Inception-ResNet-A the normal
layers follow the convolutional layers. Inception-ResNet
normal convolutional combines
layers. Inception-ResNet
combines
the ideas the ideas of
of Inception, Inception,
which whichfeatures
captures capturesatfeatures
multipleat multiple
sizes bysizes by concatenating
concatenating the of
the results
results of convolutional
convolutional layers with
layers with different different
kernel sizes,kernel sizes, andnetworks
and residual residual networks
[3], which[3],
use which
skipuse skip
connections
connections
to create a simple topath
createfora information
simple path for information
to flow to flow
throughout throughout
a neural a neural
network. Thisnetwork. This was
architecture
architecture was used because of its high performance on several competitive image recognition
used because of its high performance on several competitive image recognition challenges [6]. Residual
challenges [6]. Residual networks converge faster because residual connections speed up training in
networks converge faster because residual connections speed up training in deep networks [3]. Figure 4
deep networks [3]. Figure 4 shows the design of the modified Inception-ResNet-A module that is
shows theindesign
used of the
this work. Themodified
final 1 1Inception-ResNet-A module
convolution only calculates 192that is used
features, in this to
compared work.
256 inThe
the final
1 1original
convolution only calculates
Inception-ResNet-A [6]. 192 features, compared to 256 in the original Inception-ResNet-A [6].

Figure
Figure 4.4.Modified
Modified Inception-ResNet-A
Inception-ResNet-A module.
module.

As can be seen in Figure 4, the modified Inception-ResNet-A module consists of three parallel
As canconcatenated
layers be seen in Figure
into one.4,The
the result
modified Inception-ResNet-A
of this concatenation is then module
added toconsists of three
the activations parallel
of the
layersprevious
concatenated
layer into
and one. The result
passed through of this
the concatenation
rectified linear is then addedAfter
function. to thethe
activations
modifiedof the
previous layer and passed
Inception-ResNet-A through
layers, the rectified
a modified linear
Inception function.
reduction After the
module, shownmodified Inception-ResNet-A
in Figure 5, is used to
simultaneously
layers, reduce thereduction
a modified Inception image sizemodule,
and expand the number
shown in Figureof filters. As can
5, is used be seen in Figure reduce
to simultaneously 5,
three parallel branches are concatenated into one output. These branches include
the image size and expand the number of filters. As can be seen in Figure 5, three parallel branches aremaximum pooling
and strided convolutions without padding (stride 2 V in Figure 5). The middle branch of the module
concatenated into one output. These branches include maximum pooling and strided convolutions
was reduced from an output size of 192 to 128. The right branch of the module was reduced to 192,
without padding (stride 2 V in Figure 5). The middle branch of the module was reduced from an output
128, 128, and 128 output sizes from 256, 256, 320, and 320, respectively. These changes were made in
size of 192 to 128. The right branch of the module was reduced to 192, 128, 128, and 128 output sizes
order to fit the network more closely to the complexity of the problem. Before these modifications,
from the
256, 256, tended
model 320, and to 320, respectively.
overfit to the trainingThese
data changes were poorly
and performed made in onorder to fit the network more
real data.
closely to After
the complexity
the modified of the problem.
reduction Beforeanother
module these modifications, the model tendedlayers
set of two Inception-ResNet-A to overfit
wereto the
training datafollowed
applied, and performed
by 3 3 poorly
averageon real data.
pooling because average pooling has been found to improve the
accuracy
After thewhen used reduction
modified before the final
modulefullyanother
connected setlayer [42].
of two As can be seen in Figure
Inception-ResNet-A layers3 the sizeapplied,
were of
the final fully connected layer is 768. Although deep neural nets with a large
followed by 3 3 average pooling because average pooling has been found to improve the accuracy number of parameters
whenare verybefore
used powerfulthe machine learning
final fully systems,
connected overfitting
layer [42]. As is can
a serious
be seen problem in such
in Figure networks.
3 the size of the
Large networks are also slow to use, making it difficult to deal with overfitting by combining the
final fully connected layer is 768. Although deep neural nets with a large number of parameters are
predictions of many different large neural nets at test time. Dropout is a technique for addressing
very powerful machine learning systems, overfitting is a serious problem in such networks. Large
this problem. The key idea is to randomly drop units (along with their connections) from the neural
networks are also slow to use, making it difficult to deal with overfitting by combining the predictions
network during training [43]. Sixty-five percent of connections were randomly kept while training
of many differentFinally,
the network. large neural
the lastnets at test time.layer
fully-connected Dropout is adropout
after the technique layerfor addressing
gives this problem.
the prediction for
The key idea is to randomly drop units (along with their connections) from
the number of tomatoes in the input image. Batch normalization was performed after every the neural network during
training [43]. Sixty-five
convolution percent
to remove of connections
the internal covariate were randomly kept while training the network. Finally,
shift [38].
the last fully-connected layer after the dropout layer gives the prediction for the number of tomatoes
in the input image. Batch normalization was performed after every convolution to remove the internal
covariate shift [38].
Sensors 2017, 17, 905 6 of 12
Sensors 2017, 17, 905 6 of 12

Figure 5.
Figure 5. Modified
Modified reduction
reduction module.
module.

3.5. Training
3.5. Training Methodology
Methodology
The network
The network was was trained
trained for
forthree
threeepochs
epochson on24,000
24,000synthetic
syntheticimages.
images.ToTo minimize
minimize thethe
error an
error
Adam optimizer is used [44]. It is an algorithm for first-order gradient-based
an Adam optimizer is used [44]. It is an algorithm for first-order gradient-based optimization of optimization of
stochastic objective
stochastic objective functions,
functions,based
basedon onadaptive
adaptiveestimates
estimates of of
lower-order
lower-ordermoments.
moments. Inspired by two
Inspired by
popular optimized methods, namely, AdaGrad and RMSProp,
two popular optimized methods, namely, AdaGrad and RMSProp, Kingma and Ba [44] came Kingma and Ba [44] came up withupa
new aoptimizer
with that can
new optimizer deal
that canwith
deal awithsparse gradient,
a sparse as well
gradient, as non-stationary
as well as non-stationaryobjectives. The
objectives.
advantages of using the Adam optimizer include being computationally
The advantages of using the Adam optimizer include being computationally efficient, low memory efficient, low memory
requirements,invariance
requirements, invariancetotodiagonal
diagonal rescaling
rescaling of of
thethe gradients,
gradients, andand being
being well-suited
well-suited for problems
for problems that
thatlarge
are are large
in termsin terms
of dataofordata or parameters.
parameters. The learning
The learning rate for rate for theoptimizer
the Adam Adam optimizer
was set atwas set at a
a constant
1constant
103 .1The 10mean
3. The mean squared error was used as the cost function. The network was evaluated
squared error was used as the cost function. The network was evaluated using
using
the the exponential
exponential moving averages
moving averages of weights.ofWeights
weights. Weights
were wereusing
initialized initialized
a Xavierusing a Xavier
initializer [45].
initializer [45]. Xavier initialization ensures the weights are appropriate by keeping
Xavier initialization ensures the weights are appropriate by keeping the signal in a reasonable range the signal in a
reasonable range of values throughout the layers. It tries to keep the variance
of values throughout the layers. It tries to keep the variance of the input gradient and the output of the input gradient
and the output
gradient the same gradient the same
which helps which
to keep thehelps
scale to
of keep the scaleapproximately
the gradients of the gradients theapproximately
same throughout the
same throughout
the network. the network.

4. Experimental
4. Experimental Results
Results
The network was
The was implemented
implementedusingusingTensorFlow
TensorFlow[46] running
[46] runningon on
an an
NVidia
NVidia980Ti GPU.
980Ti For
GPU.
training,
For 24,000
training, synthetic
24,000 images
synthetic werewere
images used. For For
used. testing, a different
testing, set of
a different set2400 synthetic
of 2400 images
synthetic and
images
100 100
and randomly-selected realreal
randomly-selected tomato images
tomato imagesfrom
fromGoogle
GoogleImages
Imageswere
wereused.
used.The
The size
size of synthetic
synthetic
images are
images are 128
128
128
128 pixels. Real
Real images
images have different sizes, but all were
were resized
resized to
to 128
128 128
128 pixels.
pixels.

4.1.
4.1. Experimental
Experimental Results
Results with
with Synthetic
Synthetic Data
Data
Validation
Validation onon 2400
2400 synthetic
synthetic images
images gives a mean squared error for the count of about 1.16.
Figure
Figure 66 shows
shows the
the mean
mean square
square error
error for
for training
training where where the
the abscissa
abscissa represents
represents the
the number
number of of steps
steps
and
and the
the ordinate
ordinate represents
represents the
the mean
mean square
square error.
error. The The network
network was
was trained
trained with
with three
three different
different
dropout
dropout values
values (50%,
(50%, 65%,
65%, and
and 80%)
80%) toto find
find the
the lowest
lowest valuevalue for the mean square error,
error, and
and 65%
65% was
was
chosen
chosen asas the
the dropout
dropout value
value for
for the
the network.
network. Figure 6 shows the mean mean square
square error
error for
for aa dropout
dropout
value
value of
of 65%.
65%. Looking
Looking atat the
the graph
graph in in Figure
Figure 6,6, itit is
is clear
clear that
that the
the network
network converges
converges quickly;
quickly; this
this is
is
why the network was trained for only three
why the network was trained for only three epochs. epochs.
Sensors 2017, 17, 905 7 of 12
Sensors 2017, 17, 905 7 of 12

Sensors 2017,
Sensors
SensorsSensors 17,
17,Sensors
2017,Sensors
905
2017, 2017,
17,
Sensors
Sensors
Sensors
90517,
2017,
Sensors
Sensors 2017, 2017,
90517,
2017,
2017,
2017,
905
2017,
17, 90517, 17,
17, 905
17,905
17,
905
905 905
905 77 of
of 12
12777 of
of 12
of 12 of7712
12777of
of of
of 12
of
712
12 12
12

Figure 6. Mean square error for training at a dropout value of 65%.


Figure 6. Mean square error for training at a dropout value of 65%.
4.2. Experimental Results with Real Data
4.2. Experimental
TheResults
algorithmwith
Figure
Figure 6.
was
6.
Figure Real
Figure
Mean
Figure tested
Mean
6.
Figure Data
6.Figure
6.
square
Mean
Figure 6.over
Figure
square
Mean
6. 6.
Mean
MeanMean
6.error
square
Mean
100
error
square
Mean square
square
for
square
for
square error
error
training
error for
error
square
training
error for for
at
for
error
randomly-chosen for
training
error for training
at training
aatraining
training dropout
at
training
for at
training
dropout
at aaaat
atvalue
aaimages;
dropout aa despite
dropout
dropout
of
of 65%.
value
dropout
at
value
dropout
at value
value
of not
value
dropout
65%.
value
dropout of of
65%.
65%.
value of
of 65%.
65%.
ofhaving
value
of 65%.
65%.65%.real images for
training, the network performs well for real images. Table 1 shows twenty representative images
TheExperimental
4.2.
4.2. algorithm
4.2. 4.2.4.2.
4.2.
Experimental
4.2. 4.2.
along was
Experimental
Results
Experimental
Results
Experimental
Experimental
with their tested
Experimental
Results
with
Results
Experimental
4.2. Real
withover
Results
Results
Experimental
with
Results Real
with
Results
predicted with
Data
Real
with
Results
Data
Real
with
and 100
with
Real
Data
Real
with
Data
Real
actualrandomly-chosen
Real Data
Data
Data
Real images;
Data In Table 1 column
Data
count. despite
R contains not
the real having
images, realPimages
column for
training,Thethe
The network
contains
The the
The
algorithm
The
algorithm
The The performs
predicted
algorithm
algorithm
The algorithm was
algorithm tested
algorithm
Thewas algorithm
tested
was
algorithm was
over
was tested
tested
was
over
was well
count,
was and
tested
100
over
tested
was 100
over
testedfor
over
100real
column
tested
over
tested over
100 images.
GT
100100 contains
randomly-chosen
randomly-chosen
randomly-chosen
over
randomly-chosen
100
over Table
randomly-chosen
images;
randomly-chosen
100 randomly-chosen
randomly-chosen
100 images;
randomly-chosen the 1
images;
images; shows
actual
images;
despite
despite count
images;
despite
images;
images;
images; not
despite twenty
(ground
despite
despite
not having
not
despite
having
not
despite representative
not truth).
not
real
having
not
despite real
having
not having
having
images
real
having
not images
real
having realreal
for
images
havingreal
for
images
real images
images
images
for
images
real images
for
images forfor
for
for foralong
with training,
training, the
theirtraining, training,
training,
predicted
training, network
training,
the training,
network
the
training, thethe
the
and
the network
network
the network
network
performs
actual
network
the network
performs
network wellperforms
performs
count.
performs for
well
performs
well
performs performs
well
performs well
real
for Infor well
well
real
for
wellwellforfor
for
images.
Table
real
images.
real
for real
1real
for
images.
real real
Table
column
images.
Table images.
images.
Table
images.
real images.
Table
images. RTable
11 shows
Table
shows
Table Table
twenty
11contains
shows
Table
twenty
shows 11twenty
111 showsshows
twenty
shows
showsshows twenty
twenty
representative
the real representative
representative
images,
representative
twenty
twenty
representative
representative
twenty images
column
imagesimages
representative
representative
images
representative images
images
P contains
images
images
images
along with Table
their 1.and
Realactual
predicted tomato
and images
actual with
Incount. predicted
In Table (P) and
11 column actualR count
contains(GT).
alongalong
alongalong
the predicted along
with
with their
with
along
their
with
along
count,with
alongtheir
with their
predicted
with
predicted
their
with their
and predicted
and
predicted
their and
predicted actual
predicted
their predicted
actual
and GT
predicted
column and
and actual
count.
and
count.
actual
and In
count.
actual
count.
actual
contains count.
In Table
count.
actual
Table
In
count.
the 11In
Table
count. In Table
column
column
Table
In Table
actual Table1R
11 column
Table
In column column
contains
11count
R column
column R
column
contains
R R
contains
contains
R contains
Rthe R real
contains
the contains
real
the
contains
(ground thethe
the
theimages,
real
real
the real
the
images, real
real
images,
truth).real images,
images,
column
images,images,
real
column P
column
images,
column
images, column
P PPP P
P column
column
column
P
column P
contains
contains the
contains
the
containscontains
contains
the
contains thethe
the
predicted
contains
predicted
the Ppredicted
predicted
count,
predicted
count,
R predicted
contains the and
count,
predicted
the count,
predicted
GT count,
andcount, and
column
and
count,
predictedcolumn
and
count, andand
GT
column
count, GT
Rcolumn
and column
column
column
and GT
contains
GT
column
contains
GT
column GT
GTthe
containscontains
contains
actual
the
contains
GT
the
contains
P GTGT actual
the the
the
contains the
count
actual
count
Ractual
contains the actual
actual
(ground
count
actual
thecount
actual count
P count
count
actual
(ground (ground
(ground
truth).
(ground truth).
(ground
count
(ground
count
GT (ground
truth). truth).
R truth).
truth).
truth).
(ground truth).P
truth). GT
Table
Table 1.
TableTable
Real
1.Real
Table
1. Real
Table Table
1.
Table
tomato
Table
1.
1.tomato
Real 1.
Real
1.tomato
Table
tomato
Real
1. Real
1.
Real
images
Real
images
tomato tomato
tomato
tomato
Real
images with
images
tomato
with
images
images
with
imagesimages
tomato
predicted
with
with
predicted
with with
images
predicted
with
images with predicted
(P) and
predicted
predicted
with
(P) and
predicted (P)
(P)
predicted
(P)
(P)actual
and and
(P)
and
count
actual
(P)actual
predicted
actual
and
predicted (P) (P)
count
and
actual
andand actual
actual
(GT).
count
actual
and (GT).
count
actual
count
count
count
(GT).
count
actual
(GT).
count
(GT).
(GT).
(GT).
(GT).
count (GT).
(GT).
Table 1. Real tomato images with predicted (P) and actual count (GT).
R R R
P R
P PGT
RPPGT
GT P
PP GTR GT
PR R
GTR R
P R
P GT
RPPPGTPPGTP
GTRGT
PR R
GTR RPR
RP RGT PPP P
P 18GT
GT RGT
GTR R R
RPPR
R GT PP P
P 27GT
GT P GT
GT
R
R R R GT GT
36 38 GT R R GT GT
27 24 GT R GT
P 17PGTGT R
GT GT
P 28PGTGTGT

36
363638363838
3836 36 36
3836
36 38 38
38
36
38 38 27
27 24272724
2427 272427
2427
27 24 24
24
27
24 24 18
18 17 18
18 17
17
18 171818
18
18 18171717
17
17 17 27
27 28 27 27
27
27 28
28
27 28
27 28 28
2728
28
27 2828

22 25 21 23 15 14 12 12

22
222225222525
2522 22 22
2522
22 25 25
25
22
25 25 21
21 23212123
2321 212321
2321
21 23 23
23
21
23 23 15
15 14 15
15 14
14
15 141515
15
15 15141414
14
14 14 12
12 12 12 12
12
12 12
12 12 12 12
1212
12
12 1212

22 22 13 12 14 14 14 13

22
222222222222
2222 22 22
2222
22 22 22
22
22 22 13
13 12131312
1213 131213
1213
13 12 12
12
13
12 12 14
14 14
14
14 14
14 14
141414
14
14 14141414
14
14 14 14
14 13 14 14
14
14 13
13
14 13
14 13 13
1413
13
14 1313

20 25 19 19 38 39 16 16

20
202025202525
2520 20 20
2520
20 25 25
25
20
25 25 19
19 19191919
1919 191919
1919
19 19 19
19
19 19 38
38 39 38
38 39
39
38 393838
38
38 38393939
39
39 39 16
16 16
16
16 16 16
16
16 16
16 16 16
1616
16
16 1616

22 22 16 17 16 19 24 24

22
222222222222
2222 22 22
2222
22 22 22
22
22 22 16
16 17161617
1716 161716
1716
16 17 17
17
16
17 17 16
16 19 16
16 19
19
16 191616
16
16 16191919
19
19 19 24
24 24 24 24
24
24 24
24 24 24 24
2424
24
24 2424

The network was trained to count ripe and half-ripe tomatoes. The algorithm can handle the
variation in illumination, size, shadow, and also images with overlapped and partially-occluded fruits.
For example, fruits in the second row and fourth column image are partially occluded by leaves and
they are under different illumination conditions. However, the actual count and predicted count by
our algorithm are exactly the same (=12). Another example is the image in the last row and the last
The network was trained to count ripe and half-ripe tomatoes. The algorithm can handle the
variation in illumination, size, shadow, and also images with overlapped and partially-occluded
fruits. For example, fruits in the second row and fourth column image are partially occluded by
leaves and they are under different illumination conditions. However, the actual count and
Sensors 2017, 17, 905 8 of 12
predicted count by our algorithm are exactly the same (=12). Another example is the image in the last
row and the last column, which has overlapped fruits with different sizes and is occluded by foliage;
still,
column,the actual
whichcount and predicted
has overlapped count
fruits withby our algorithm
different areisexactly
sizes and the by
occluded same (=24).still, the actual
foliage;
count and predicted count by our algorithm are exactly the same (=24).
4.3. Evaluation
4.3. Evaluation
To evaluate the performance of our results, we compared the predicted count of our algorithm
with To
theevaluate the performance
actual count. of our
The actual count results,
was we by
attained compared theaverage
taking the predicted count
count of ourindividuals
of three algorithm
with the actual count. The actual
observing the images independently. count was attained by taking the average count of three individuals
observing the images
The accuracy wasindependently.
calculated as follows:
The accuracy was calculated as follows:
 pc ac 
pa(%) 1 | pc ac| 100 (1)
pa(%) = 1 ac 100 (1)
| ac|

where, pa is
where, pa is the
the accuracy
accuracy(%), pcisisthe
(%),pc thepredicted
predictedcount, andacacisisthe
count,and theactual
actual count.
count. AsAscancan
be be seen
seen in
in Figure 7, the accuracy is between 70% and 100% and the average accuracy for 100
Figure 7, the accuracy is between 70% and 100% and the average accuracy for 100 images is equal to images is
equal to 91.03%.
91.03%.

Figure
Figure 7.
7. The
The accuracy
accuracy for
for all
all 100
100 images.
images.

A linear regression was performed between computed and actual counts as shown in Figure 8.
A linear regression was performed between computed and actual counts as shown in Figure 8.
R2 value
Sensors
2
of17,
2017, 0.90
905in Figure 8 suggests that the regression line fits well over the data which means the
9 of 12
R value of 0.90 in Figure 8 suggests that the regression line fits well over the data which means the
computed count of the tomatoes is similar to the actual count. The root mean square error (RMSE)
computed
count count
bysynthetic
area of the
based tomatoes
method andisactual
similar to the
count actualhundred
count. The
realroot mean squareTheerror (RMSE) for
for 2400 images is equal to 1.16, andfor
forone
100 real images ittomato
is 2.52,images. RMSE
based on our of 100
proposed
2400 synthetic
real images images
based is equal
on this to 1.16,
method and for 100 real images it is 2.52, based on our proposed method.
is 13.56.
method.

4.4. Comparison with Other Techniques


60
We compared our results with50 several y = 0.963x namely
methods, + 0.6747area-based technique, shallow neural
R = 0.9058
network, and our network with the original Inception-ResNet.
Actual count

40
Area-based techniques calculate the number of fruits based on the total area of fruits and an
individual fruit. We applied mathematical
30 morphology techniques after converting our RGB images
to YCBCR space to isolate the pixels that belong to tomatoes. After calculating the total fruit pixels in
each image, we divided it by the20average pixel coverage of one tomato to get the count. The average
pixel number of each tomato was 10 attained experimentally so that there is a minimum distance
between the actual count and the calculated count based on the area. The average accuracy over one
hundred images for this method 0is 66.16%. Figure 9 shows a linear regression between computed
0 20 40 60
Computed count

Figure 8. A linear regression between computed and actual counts for 100 real tomato images.
Figure 8. A linear regression between computed and actual counts for 100 real tomato images.

60

50
y = 0.2763x + 15.913
40 R = 0.2661
l count

30
Sensors 2017, 17, 905 9 of 12

count by area based method and actual count for one hundred real tomato images. The RMSE of 100
real images based on this method is 13.56.
Sensors 2017, 17, 905 9 of 12
60

50
4.4. Comparison with Other Techniques y = 0.963x + 0.6747
R = 0.9058

Actual count
We compared our results with 40 several methods, namely area-based technique, shallow neural
network, and our network with the original Inception-ResNet.
30
Area-based techniques calculate the number of fruits based on the total area of fruits and
20
an individual fruit. We applied mathematical morphology techniques after converting our RGB images
to YCBCR space to isolate the pixels that belong to tomatoes. After calculating the total fruit pixels
10
in each image, we divided it by the average pixel coverage of one tomato to get the count. The average
0
pixel number of each tomato was attained experimentally so that there is a minimum distance between
0
the actual count and the calculated count based20on the area.
40 The average
60 accuracy over one hundred
Computed count
images for this method is 66.16%. Figure 9 shows a linear regression between computed count by area
based method and actual count for one hundred real tomato images. The RMSE of 100 real images
based onFigure
this method is regression
8. A linear 13.56. between computed and actual counts for 100 real tomato images.

60

50
y = 0.2763x + 15.913
40 R = 0.2661
Actual count

30

20

10

0
0 20 40 60 80 100
Computed count by area-based method

Figure 9. A linear regression between computed counts by the area-based method and the actual
Figure 9. A linear regression between computed counts by the area-based method and the actual count
count for 100 real tomato images.
for 100 real tomato images.

It can be inferred from Figure 9 that the performance of the method is not consistent and the R2
value iscan
It alsobevery
inferred
poor.from Figure 9 that the performance of the method is not consistent and the
2
R value is also very poor.
We also trained and tested the results over a shallow network. The shallow network consists of
We also trained
two convolutional layers and tested
and twothefully-connected
results over a shallow
layers. network.
In the thirdThe shallowwe
method, network consists
used the of
original
two convolutional layers
Inception-ResNet-A moduleandoftwo
thefully-connected layers.[7]Inlayer
Inception-ResNet-v4 the third method,
instead of ourwe used the
modified original
version in
Inception-ResNet-A module of the Inception-ResNet-v4 [7] layer instead of our
Figure 4. Table 2 shows the average accuracy over one hundred images using the proposed methodmodified version
in
and Figure
three4.other
Tablemethods.
2 shows the average accuracy over one hundred images using the proposed method
and three other methods.
Table 2. Average accuracy over 100 images.
Table 2. Average accuracy over 100 images.
Method Average Accuracy (%)
Method
Proposed method Average Accuracy
91.03 (%)
Area-based counting
Proposed method 66.16
91.03
Area-based counting
Shallow network 66.16
11.60
Shallow
Our network with network
the original Inception-ResNet-A 11.60
76.00
Our network with the original Inception-ResNet-A 76.00

It can be inferred from Table 2 that the proposed method is significantly better than the area-based
method. The reason is that the area-based method is not scale invariant. Moreover, the main problem
with area-based methods is that whenever there is occlusion by other tomatoes, foliage, or branches,
Sensors 2017, 17, 905 10 of 12

the total pixel coverage of the tomatoes will be less than the actual coverage and will lead to a false
count of the tomatoes.
Table 3 shows the average time required counting the tomatoes in one test image using
the proposed method, the area-based method, and the time required by a human.

Table 3. Average time for counting.

Method Average Time Required for One Test Image (second)


Proposed method 0.006
Area-based method 0.05
Manual counting 6.5

With the help of Table 3, it is clear that the proposed method is faster than that of the area-based
and manual counting methods.

5. Conclusion and Future Works


We proposed a simulated learning for counting fruits. We based our architecture on Inception-ResNet
to achieve high accuracy and to lower the computation cost. It is very difficult to obtain a sufficient
number of real images with their actual count for the training stage in deep learning; in this paper
we generated synthetic tomato images for training the network. We observed 91% accuracy for one
hundred randomly-chosen real images. Our algorithm is robust under poor conditions. It can count
accurately even if tomatoes are under shadow, occluded by foliage, branches, or if there is some degree
of overlap amongst tomatoes. Although our algorithm was trained to count tomatoes, it can be applied
to other fruits. Our algorithm is able to count ripe and half-ripe fruits; however, it fails to count green
fruits because it is not trained for this purpose. In the future, we are planning to add green fruits to the
synthetic dataset, so it would be able to count fruits in all stages.
In the future, we are planning to develop a mobile application based on the proposed algorithm
which can be used directly by farmers for yield estimation and cultivation practices. Moreover,
the proposed algorithm will be implemented on unmanned ground vehicles (UGVs) and unmanned
aerial vehicles (UAVs) for online yield estimation and precision agriculture applications. Our efforts
will directly support citizen science.

Acknowledgments: This project was supported partially by a Texas Comprehensive Research Fund Grant from
the Texas A&M University-Corpus Christi Division of Research, Commercialization and Outreach and Texas
A&M University-Corpus Christi Research Enhancement Grant.
Author Contributions: Rahnemoonfar and Sheppard conceived and designed the experiments; Sheppard
performed the experiments; Rahnemoonfar supervised the whole project. Both authors analyzed the data
and wrote the paper.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Nair, V.; Hinton, G.E. 3D object recognition with deep belief nets. In Advances in Neural Information
Processing Systems, Proceedings of the Neural Information Processing Systems Conference, Vancouver, BC, Canada,
710 December 2009; Neural Information Processing Systems Foundation, Inc.: Ljubljana, Slovenia, 2009;
pp. 13391347.
2. Han, J.; Zhang, D.; Hu, X.; Guo, L.; Ren, J.; Wu, F. Background prior-based salient object detection via deep
reconstruction residual. IEEE Trans. Circuit Syst. Video Technol. 2015, 25, 13091321.
3. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 2730 June 2016;
pp. 770778.
Sensors 2017, 17, 905 11 of 12

4. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Advances in Neural Information Processing Systems, Proceedings of the Neural Information Processing
Systems Conference, Montreal, QC, Canada, 712 December 2015; Neural Information Processing Systems
Foundation, Inc.: Ljubljana, Slovenia, 2015; pp. 9199.
5. Zhu, Y.; Urtasun, R.; Salakhutdinov, R.; Fidler, S. Segdeepm: Exploiting segmentation and context in deep
neural networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 712 June 2015; pp. 47034711.
6. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual
connections on learning. arXiv, 2016, arXiv:1602.07261.
7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems, Proceedings of the Neural Information Processing Systems
Conference, Lake Tahoe, NV, USA, 36 December 2012; Neural Information Processing Systems Foundation, Inc.:
Ljubljana, Slovenia, 2012; pp. 10971105.
8. Socher, R.; Huval, B.; Bath, B.P.; Manning, C.D.; Ng, A.Y. Convolutional-recursive deep learning for 3d object
classification. NIPS 2012, 3, 8.
9. Ciregan, D.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification.
In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence,
RI, USA, 1621 June 2012; pp. 36423649.
10. Segu, S.; Pujol, O.; Vitria, J. Learning to count with deep object features. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 712 June 2015; pp. 9096.
11. Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA,
712 June 2015; pp. 833841.
12. Onoro-Rubio, D.; Lpez-Sastre, R.J. Towards perspective-free object counting with deep learning.
In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands,
816 October 2016; pp. 615629.
13. Xie, W.; Noble, J.A.; Zisserman, A. Microscopy cell counting with fully convolutional regression networks.
In Proceedings of the MICCAI 1st Workshop on Deep Learning in Medical Image Analysis, Munich, Germany,
59 October 2015.
14. Laliberte, A.S.; Ripple, W.J. Automated wildlife counts from remotely sensed imagery. Wildl. Soc. Bull. 2003,
31, 362371.
15. Del Ro, J.; Aguzzi, J.; Costa, C.; Menesatti, P.; Sbragaglia, V.; Nogueras, M.; Sarda, F.; Manul, A. A new
colorimetrically-calibrated automated video-imaging protocol for day-night fish counting at the obsea coastal
cabled observatory. Sensors 2013, 13, 1474014753. [CrossRef] [PubMed]
16. Ryan, D.; Denman, S.; Fookes, C.; Sridharan, S. Crowd counting using multiple local features. In Proceedings
of the Digital Image Computing: Techniques and Applications, Melbourne, Australia, 13 December 2009;
pp. 8188.
17. Kim, J.-W.; Choi, K.-S.; Choi, B.-D.; Ko, S.-J. Real-time vision-based people counting system for the security
door. In Proceedings of the International Technical Conference on Circuits/Systems Computers and
Communications, Phuket Arcadia, Thailand, 1619 July 2002; pp. 14161419.
18. Lempitsky, V.; Zisserman, A. Learning to count objects in images. In Proceedings of the Advances in Neural
Information Processing Systems, Vancouver, BC, Canada, 611 December 2010; pp. 13241332.
19. Giuffrida, M.V.; Minervini, M.; Tsaftaris, S.A. Learning to count leaves in rosette plants. In Proceedings of
the BVMC (British Machine Vision Conference), Swansea, UK, 710 September 2015.
20. Wang, Q.; Nuske, S.; Bergerman, M.; Singh, S. Automated crop yield estimation for apple orchards.
In Experimental Robotics; Springer: Berlin, Germany, 2013; pp. 745758.
21. Li, Y.; Cao, Z.; Lu, H.; Xiao, Y.; Zhu, Y.; Cremers, A.B. In-field cotton detection via region-based semantic
image segmentation. Comput. Electron. Agric. 2016, 127, 475486. [CrossRef]
22. Lu, H.; Cao, Z.; Xiao, Y.; Li, Y.; Zhu, Y. Region-based colour modelling for joint crop and maize tassel
segmentation. Biosyst. Eng. 2016, 147, 139150. [CrossRef]
23. Schillaci, G.; Pennisi, A.; Franco, F.; Longo, D. Detecting tomato crops in greenhouses using a vision
based method. In Proceedings of the International Conference Ragusa SHWA2012, Ragusa Ibla, Italy,
36 September 2012; pp. 252258.
Sensors 2017, 17, 905 12 of 12

24. Wang, L.; Liu, S.; Lu, W.; Gu, B.; Zhu, R.; Zhu, H. Laser detection method for cotton orientation in robotic
cotton picking. Trans. Chin. Soc. Agric. Eng. 2014, 30, 4248.
25. Teixid, M.; Font, D.; Pallej, T.; Tresnchez, M.; Nogus, M.; Palacn, J. Definition of linear color models
in the rgb vector color space to detect red peaches in orchard images taken under natural illumination.
Sensors 2012, 12, 77017718. [CrossRef] [PubMed]
26. Wei, J.D.; Fei, S.M.; Wang, M.L.; Yuan, J.N. Research on the segmentation strategy of the cotton images on
the natural condition based upon the hsv color-space model. Cotton Sci. 2008, 20, 3438.
27. Linker, R.; Cohen, O.; Naor, A. Determination of the number of green apples in rgb images recorded
in orchards. Comput. Electron. Agric. 2012, 81, 4557. [CrossRef]
28. Tabb, A.L.; Peterson, D.L.; Park, J. Segmentation of apple fruit from video via background modeling.
In Proceedings of the 2006 ASABE Annual Meeting, Oregon, Portland, 912 July 2006.
29. French, G.; Fisher, M.; Mackiewicz, M.; Needle, C. Convolutional neural networks for counting fish
in fisheries surveillance video. In Proceedings of the BMVC (British Machine Vision Conference), Swansea,
UK, 710 September 2015.
30. Bargoti, S.; Underwood, J. Deep fruit detection in orchards. arXiv, 2016, arXiv:1610.03677.
31. Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. Deepfruits: A fruit detection system using deep
neural networks. Sensors 2016, 16, 1222. [CrossRef] [PubMed]
32. Bargoti, S.; Underwood, J. Image segmentation for fruit detection and yield estimation in apple orchards.
arXiv, 2016, arXiv:1610.08120.
33. ulc, M.; Mishkin, D.; Matas, J. Very deep residual networks with maxout for plant identification in the wild.
In Proceedings of the CLEF 2016 Conference, Evora, Portugal, 58 September 2016.
34. Hou, L.; Wu, Q.; Sun, Q.; Yang, H.; Li, P. Fruit recognition based on convolution neural network.
In Proceedings of the 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge
Discovery (ICNC-FSKD), Changsha, China, 1315 August 2016; pp. 1822.
35. Filipe, S.; Alexandre, L.A. From the human visual system to the computational models of visual attention:
A survey. Artif. Intell. Rev. 2013, 39, 147.
36. Liu, T.; Fang, S.; Zhao, Y.; Wang, P.; Zhang, J. Implementation of training convolutional neural networks.
arXiv, 2015, arXiv:1506.01195 2015.
37. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 712 June 2015; pp. 19.
38. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating deep network training by reducing internal covariate
shift. arXiv, 2015, arXiv:1502.03167.
39. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV,
USA, 26 June1 July 2016; pp. 28182826.
40. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA,
2025 June 2009; pp. 248255.
41. Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv, 2016, arXiv:1603.07285.
42. Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv, 2013, arXiv:1312.4400.
43. Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to
prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 19291958.
44. Kingma, D.; Ba, J. Adam: A method for stochastic optimization. arXiv, 2014, arXiv:1312.4400.
45. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. Aistats
2010, 9, 249256.
46. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.
Tensorflow: Large-Scale Machine Learning on Heterogeneous Systems, Version 2; 2015. Available online:
www.tensorflow.org (accessed on 20 April 2017).

2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Anda mungkin juga menyukai