0.05
Grasp
a) 0
−0.05
−0.1
?
0 1 2 3 4 5 6 7
4
x 10
Sphinx4
?
Shake
10
b) 15
20
25
30
Drop
?
50 100 150 200 250 300 350 400 450
Trained SOM
?
Fig. 4. Audio signal processing and sound representation: a) The raw sound
recorded after the robot performs the shake behavior on the Vitamin C bottle.
b) Computed spectrogram of the sound. The horizontal axis denotes time,
while the vertical dimension denotes the 33 frequency bins. Orange-yellow
color indicates high intensity. c) The sequence of states in the SOM for
the detected sound, obtained after each column vector of the spectrogram is
mapped to a node in the SOM. The length of the sequence Si is li , which is
Tap
the same as the length of the horizontal time dimension of the spectrogram
shown in b). Each sequence token sij ∈ A, where A is the set of SOM
nodes.
Fig. 3. Before and after snapshots of the five behaviors used by the robot.
points, cji ∈ R33 which represent the intensity levels for
each of the 33 spectrogram frequency bins at a given point
in time. Due to memory and runtime constraints, only 1/8 of
IV. LEARNING METHODOLOGY
the total available column vectors in P, sampled at random,
A. Feature Extraction using a Self-Organizing Map were used to train the SOM. The Growing Hierarchical SOM
Each sound, Si , is represented as a sequence of nodes toolbox for Java was used to train the SOM [13]. The SOM
in a Self-Organizing Map (SOM) [11]. To obtain such a was trained using the default parameters for a non-growing
representation, features from each sound were first extracted 2-D single layer map. Figure 5 gives a visual overview of
using the log-normalized Discrete Fourier Transform (DFT) the training procedure.
which was computed for each sound using 25 + 1 = 33 After training the SOM, each spectrogram, Pi , is mapped
frequency bins using a window of 26.6 milliseconds (ms), to a sequence of states, Si , in the SOM by mapping the
computed every 10.0 ms. The SPHINX4 natural language columns of Pi to nodes in the map. A mapping function is
processing library (with default parameters) was used to defined, M (cij ) → sij , where cij ∈ R33 is the input column
compute the DFT [12]. Fig. 4 a) and b) show an example vector and sij is the node in the SOM with the highest
sound wave and the resulting spectrogram after applying the activation value given the current input cij . Thus, each sound
Fourier transform. The spectrogram encodes the intensity is represented as a sequence, Si = si1 si2 . . . sili , where each
level of each frequency bin (vertical axis) at each given point sik ∈ A, A is the set of SOM nodes, and li is the number
in time (horizontal axis). of column vectors in the spectrogram, as shown in Fig. 4.
Let Pi be a spectrogram, such that Pi = [ci1 , ci2 , . . . , cili ] In other words, each sequence Si consists of a sequence of
where each cij ∈ R33 (i.e., cij is the 33-dimensional column states in the SOM.
feature vector of the spectrogram at time slice j) and li is The machine learning algorithms used in this study require
the number of column vectors in the spectrogram Pi . Given a symmetric similarity function that can compute how similar
a collection of spectrograms, P = {Pi }K i=1 , a set of column two sequences Si and Sj are. Computing similarity measures
vectors is sampled from them as an input dataset and used to between a pair of sequences over a finite alphabet (e.g.,
train a two dimensional SOM of size 6 by 6, i.e., containing strings) is a well established area of computer science,
a total of 36 nodes. The SOM is trained with input data resulting in a wide variety of algorithms for exact and
and for each behavior b ∈ B. The next subsection describes
Set of Spectrograms: Column Vectors:
the two learning algorithms used to solve this task.
C. Learning Algorithms
- 1) K-Nearest Neighbor: K-Nearest Neighbor (k-NN) is a
... learning algorithm which does not build an explicit model
of the data, but simply stores all data points and their class
..
. labels and only uses them when the model is queried to make
a prediction. The k-NN model falls within the family of lazy
learning or memory-based learning algorithms [16], [17].
Trained SOM: To make a prediction on a test data point, k-NN finds
? its k closest neighbors in the training set. In other words,
given a test data point Si , k-NN finds the k training data
GHSOM points most similar to Si . The algorithm returns a class label
prediction which is a smoothed average of the labels of the
Toolbox
selected neighbors. The normalized global alignment score,
N W (Si , Sj ), is used as the similarity metric between two
data points Si and Sj .
Fig. 5. Illustration of the procedure used to train the Self-Organizing Map
In all experiments described below, k was set to 3. An
(SOM). Given a set of spectrograms, a set of column vectors are sampled estimate for P r(Oi = o|Si ) is obtained by counting the
at random and used as a dataset for training the SOM. object class labels of the k neighbors. For example, if two of
the three neighbors have object class label Plastic Cup then
P r(Oi = Plastic Cup|Si ) = 23 . Similarly, if the class label of
approximate string matching [14]. In this study, we define the remaining neighbor is Plastic Box, then P r(Oi = Plastic
a similarity function, N W (Si , Sj ), between two such se- Box|Si ) = 13 . The probabilities P r(Bi = b|Si ) are computed
quences to be the normalized global alignment score using in the same manner.
the Needleman-Wunch alignment algorithm [15], [14]. While 2) Support Vector Machine: Support Vector Machine
global alignment is typically used for comparing biological (SVM) classifier is a supervised learning algorithm from the
protein and DNA sequences, it can also be used as a measure family of discriminative models [18]. Let (xi , yi )i=1,...,l be
of similarity between two sequences over a finite alphabet a set of labeled inputs, where xi ∈ Rn and yi ∈ {−1, +1}.
in other domains, such as signal and natural language pro- SVM learns a linear decision function f (x) =< x, w > +b,
cessing [14]. To compute the score between two sequences, w ∈ Rn and b ∈ R, that can accurately classify novel
a substitution cost must be defined over each pair of tokens datapoints. The linear decision function f (x) is learned by
in the alphabet. In this study, the substitution cost between solving a dual quadratic optimization problem, in which w
two states sp and sq is set to the Euclidean distance between and b are optimized so that the margin of separation between
the corresponding SOM nodes (each of which is described the two classes is maximized [18].
by its x and y coordinate in the 2-D plane) in the map. A good linear decision function f (x) in the n-dimensional
input space, however, does not always exist. To overcome this
B. Data Collection problem, the labeled inputs can be mapped into a (possibly)
Let B = [grasp, shake, drop, push, tap] be the set of higher-dimensional feature space, e.g., xi → Φ(xi ), where
exploratory behaviors available to the robot. For each of the a good linear decision function can be found. Rather than
five behaviors, the robot performs 10 trials with each of the directly computing Φ(xi ), the mapping is defined implicitly
36 objects resulting in a total of 5×10×36 = 1800 recorded with a kernel function K(xi , xj ) =< Φ(xi ), Φ(xj ) > which
interaction trials. During the ith trial, the robot records a replaces the dot product < xi , xj > in the dual quadratic
data triple of the form (Bi , Oi , Si ), where Bi ∈ B is the optimization framework (see [18], [19] for details). The
executed behavior, Oi ∈ O is the object in the interaction, kernel function can also be considered as a measure of
and Si = si1 si2 . . . sili is the sequence of activated SOM similarity between two data points.
nodes over the duration of the sound. In other words, each While SVM is typically used on data points described by a
triple, (Bi , Oi , Si ), indicates that sound Si was detected numerical feature vector, it can also be applied to sequence
when performing behavior Bi on object Oi . data points. In the experiments conducted here, the kernel
Given such data, the task of the robot is to learn a function between two sequences of sound data points, Si
model such that given a sound sequence, Si , the robot can and Sj , is defined as a power function of the normalized
estimate the object class, Oi , and the type of interaction, global alignment score, i.e., K(Si , Sj ) = N W (Si , Sj )p ,
Bi , responsible for generating the sound Si . In other words, where p is set to 5. Because SVM is formulated to solve a
given a sound Si , the robot should be able to estimate binary-classification problem, the pairwise-coupling method
P r(Oi = o|Si ) and P r(Bi = b|Si ) for each object o ∈ O of Hastie et al. [20] was applied to generalize the original
TABLE I
binary classification SVM algorithm to the multi-class prob-
O BJECT RECOGNITION ACCURACY FOR K -NN AND SVM.
lem of object and behavior recognition as follows: a binary
SVM is trained on data from each pair of classes (in this Behavior k-Nearest Neighbor Support Vector Machine
case objects and behaviors). During prediction, each SVM Grasp 67.89 % 75.27 %
votes for one of the classes in the pair that it was trained on, Shake 49.47 % 50.56 %
and the votes are aggregated to get a final prediction. The Drop 85.79 % 80.56 %
SVM implementation in the WEKA machine learning library Push 82.89 % 84.44 %
[21] was used, which implements the sequential minimal Tap 78.15 % 75.84 %
optimization algorithm for training the model [22]. Average 72.84 % 73.33 %
0.8
0.6
0.4
0.2
−0.2
−0.4
−0.6
−0.8
−1 −0.5 0 0.5 1
Fig. 7. Isomap visualization [23] of the learned distance matrix between the objects based on their acoustic properties. Without reading too much into
the figure, we would like to point out that objects with similar physical properties (e.g., material, shape, etc.) often appear close to each other in the graph.
For example, the two pop cans appear very close to each other in the rightmost portion of the graph. Similarly, the pink and green cups are very close to
each other in the left portion of the graph. Also, several of the objects with contents inside of them are clustered in the top part of the graph. The Isomap
method places links (shown in blue dashed lines) between each point and its k closest neighbors, where k is set to 10.
R EFERENCES [12] K. E. Lee, H. W. Hon, and R. Reddy, “An overview of the SPHINX
speech recognition system,” IEEE Transactions on Acoustics, Speech,
[1] C. A. Fowler, “Auditory perception is not special: We see the world,
and Signal Processing, vol. 38, no. 1, pp. 35–45, 1990.
we feel the world, we hear the world,” Journal of Acoustical Soceity
[13] A. Chan and E. Pampalk, “Growing hierarchical self organizing map
of America, vol. 88, pp. 2910–2915, 1991.
(ghsom) toolbox: visualizations and enhancements,” in Procceedings
[2] W. W. Gaver, “What in the world do we hear? An ecological approach
of the 9th International Conference on Neural Information Processing
to auditory event perception,” Ecological Psychology, vol. 5, pp. 1–29,
(NIPS), 2002, pp. 2537–2541.
1993.
[14] G. Navarro, “A guided tour to approximate string matching,” ACM
[3] M. Grassi, “Do we hear size or sound? Balls dropped on plates,”
Computing Surveys, vol. 33, no. 1, pp. 31–88, 2001.
Perception and Psychophysics, vol. 67, no. 2, pp. 274–284, 2005.
[15] S. Needleman and C. Wunsch, “A general method applicable to the
[4] D. Norman, The Design of Everyday Things. Doubleday, 1988.
search for similarities in the amino acid sequence of two proteins,” J.
[5] C. Carello, J. Wagman, and M. Turvey, “Acoustic specification of ob-
Mol. Biol., vol. 48, no. 3, pp. 443–453, 1970.
ject properties,” in Moving Image Theory: Ecological Considerations,
[16] W. D. Aha, D. Kibler, and M. K. Albert, “Instance-based learning
J. Anderson and B. Anderson, Eds. Carbondale, IL: Southern Illionois
algorithm,” Machine Learning, vol. 6, pp. 37–66, 1991.
University Press, 2005, pp. 79–104.
[17] C. G. Atkeson, A. W. Moore, and S. Schaal, “Locally weighted
[6] E. Krotkov, R. Klatzky, and N. Zumel, “Robotic perception of material:
learning,” Artificial Intelligence Review, vol. 11, no. 1-5, pp. 11–73,
Experiments with shape-invariant acoustic measures of material type,”
1997.
in Experimental Robotics IV, ser. Lecture Notes in Control and
[18] V. Vapnik, Statistical Learning Theory. New York: Springer-Verlag,
Information Sciences. Springer Berlin/Heidelberg, 1996, vol. 223,
1998.
pp. 204–211.
[19] C. J. Burges, “A tutorial on support vector machines for pattern
[7] J. L. Richmond and D. K. Pai, “Active measurement of contact
recognition,” Data Mining and Knowledge Discovery, vol. 2, pp. 121–
sounds,” in Proc. of the IEEE International Conference on Robotics
167, 1998.
and Automation (ICRA), 2000, pp. 2146–2152.
[20] T. Hastie and R. Tibshirani, “Classification by pairwise couplling,”
[8] J. L. Richmond, “Automatic measurement and modelling of contact
Advances in Neural Information Processing Systems, vol. 10, pp. 507–
sounds,” Master’s thesis, University of British Columbia, 2000.
513, 1997.
[9] E. Torres-Jara, L. Natale, and P. Fitzpatrick, “Tapping into touch,”
[21] I. H. Witten and E. Frank, Data Mining: Practical machine learning
in Proc. of the Fifth International Workshop on Epigenetic Robotics,
tools and techniques, 2nd ed. San Francisco: Morgan Kaufman, 2005.
2005, pp. 79–86.
[22] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy, “Improve-
[10] J. Sinapov, M. Weimer, and A. Stoytchev, “Interactive learning of
ments to Platt’s SMO algorithm for SVM classifier design,” Neural
the acoustic properties of objects by a robot,” in Procceedings of
Computation, vol. 13, no. 3, pp. 637–649, 2001.
the RSS Workshop on Robot Manipulation: Intelligence in Human
[23] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric
Environments, Zurich, Switzerland, 2008.
framework for nonlinear dimensionality reduction,” Science, vol. 290,
[11] T. Kohonen, Self-Organizing Maps. Springer, 2001.
no. 5500, pp. 2319–2323, 2000.