Segmentation
Methodology
Introduction
Emotive personality traits are at the core of understanding online audiences, their
feelings, motivations and values. Those traits are harder to identify than preferences and
demographics, therefore they are more difficult to target effectively.
When advertisers initiate campaigns, their first task is often to develop sophisticate
daudience profiles. These include features such as demographic segments, affluence and
interests, but also attitudes towards well-being, family and emotional state, or in general,
emotive personality traits.
Traditional data providers derive users preferences from their online activity and
demographic data that is available to them, but would not normally have access to emotive
personality data.
This is the problem that VisualDNA set out to solve with emotive segmentation.
Verticals
The first step of quiz design is defining its verticals. Each vertical is designed to facilitate
understanding of a distinct personality trait or life skill, and normally composed of
three to seven questions. These approaches are demonstrated below using two of our
personality quizzes: Who Am I?, and VisualDNA PersonalityTest (which can be viewed at
whoami.visualdna.com and personality.visualdna.com, accordingly).
Who Am I?
The Who Am I? quiz design draws on the five factor theory of personality (Costa &
McCrae, 1992 [2]) to measure openness to new experiences, conscientiousness,
extraversion, agreeableness and neuroticism. This model is widely accepted as the
most comprehensive in explaining individual differences in personality, and so provides
a solid foundation both for capturing data and providing an engaging user experience.
Users receive a comprehensive report detailing their percentage scores across each of
the five factors, and personalised feedback on how their personality type might affect
things such as their taste, self-control, sociability, levels of composure and outlook on life.
Visual questions were based on scales from the International Personality Item Pool (IPIP;
Goldberg, 1999 [8]). As in the IPIP, we use multiple questions and also reverse questions
(identifying inverse expression of the same characteristic) in order to more precisely
identify traits. Six questions were designed for the measurement of each personality
factor, using the granular facets they are constructed from.
Personality verticals are based on psychology profiling science, while life skills verticals are
based on our considerable experience in user profiling. For example, the 'resourcefulness'
vertical measures users' potential for achieving tasks, and applies Csikszentmihalyi's
model for the experience of 'Flow' state (Csik szentmihalyi, 1997 [3]). As simple Likert
scales are capable of identifying the potential for flow experiences at work (Eisenberger,
Jones & et al. , 2005 [6]), questions were designed to represent ordinal scales as images.
Similarly, for the 'dealing with stress' vertical, we also draw on the five factor model of
personality (AKA Big Five; Costa & McCrae, 1992 [2]). However, in this case we focus on
the two personality characteristics that are associated with positive emotion and actions
(extraversion and openness to new experiences).
Question design
Visual questions are fun, simple, fast and intuitive to answer. Serious and emotionally
difficult questions can be asked using humour and euphemisms, without compromising
the clarity of the option meaning. However, some concepts are hard to express in images,
either because they are too abstract or, alternatively, too specific. In these cases we use
text-based questions. A comprehensive approach is key, both to designing questions
that cover a variety of life stages and to providing answers to those questions, while also
being as economical as possible with the number of options (which never exceeds 15).
Manual tagging is used in addition to the statistical methods described below. Weights
are assigned to different options to ensure that multiple similar answers must be given
before any characteristic is assumed.
Descriptive statistical analysis plays a crucial role in allowing us to ensure a high level
of accuracy. The distribution of answers is regularly checked for anomalies, and
questions are fine-tuned during testing phases in response to this. When a concept is
being measured across multiple questions, the distributions of the resulting scores are
assessed. Responses to conceptually complex questions are also compared with more
straightforward text-based questions designed to measure similar traits.
Factor analysis plays a crucial role at this stage, allowing us to check that users are
answering in a way that reflects the groupings that we have aimed to identify. This
technique is described in detail in the next section. In addition to such analyses, we
collect qualitative data from participants through user testing and surveys to capture
feedback on the quiz experience.
Image tagging
Holistic statistical methods, such as factor analysis, are mainly used to extract deep,
latent personality traits, which are not self-declarative. In parallel, factual self-declarative
options are also tagged manually. For example: the question How do you travel to work
in the morning? can yield hard factual answers, as well as contribute to understanding
the quiz taker's attitude to well-being and sport; the question How do you prefer to listen
to music? can indirectly help to unfold the quiz taker's attitude to morals (in addition to
factual information).
Factor analysis is a technique that identifies the hidden or latent variables that generate
observed data. In our case, this means that given a dataset of quiz answers, factor analysis
can find the unobserved personality traits that lead users to choose particular observed
answers.
Factor analysis
Factor analysis is utilised to extract latent personality information from quiz answers. It is
a two step process: the first looks at the dataset as a whole and identifies its underlying
latent structure; the second maps a user's quiz answers onto this latent structure.
The following is a high level description of our use of factor analysis. The technical reader
is encouraged to consult the appendix where the technical details of this process are
supplied.
The following is an example of the top five images that make up the 'Music' factor;
image size corresponds to factor loadings.
We derive factors for each of the quiz verticals (please refer to the quiz design
section). The number of factors identified within each vertical depends on the
variation of answers from user to user for questions in the vertical.
The following image answer combination generates a score of 2 on the music factor:
All users that scored higher than 1. 68 on the music factor were assigned the 'Music
Lover' tag. Above this threshold, the higher the score, the higher the intersection
between the actual image choices and the music factor.
Clustering tags
Factor analysis produces a large and complex list of tags that are often related to each
other. To make our tag cloud easy to navigate and to eliminate redundancy, we cluster tags
together using a hierarchical clustering algorithm.
As of May 2013 we have 0.5 million measured users and 140 million inferred users
active on our network every month.
Since measured users have completed the quiz, they are assigned emotive tags. Using
the internet browsing patterns that we have for both types of users we infer emotive
tags for inferred users. VisualDNA uses a proprietary inference system that performs
daily updates on our pool of users. Inference allows us to successfully assign emotive
tags at scale.
We achieve this goal with a process that starts with designing insightful quizzes, and
collecting user profiles at scale. The next step is using factor analysis to extract latent
personality dimensions, and transform quiz answers to factor score representation. All users
are then assigned emotive tags, which in turn seed inference of emotive tags for users who
have not answered the quiz.
VisualDNA emotive segments are available for targeting through all major platforms,
including DoubleClick, Turn, AppNExus and Videology.
The n k matrix is called the loadings matrix and is a parameter of the model. Equation
1 formalises the idea that the observed data is generated by unobserved hidden
phenomena. x is the observed data. Roughly speaking, it is generated by the unobserved
y by first sampling y from a standard normal and then transforming it according to . In
this sense, e
xplains how the unobserved phenomena generates the observed data.
The above form of factor analysis cannot be applied directly to VisualDNA quizzes
because our observed data is categorical (each question forms a categorical variable),
breaking the assumption of equation 1. To apply the model, we represent user answers
in a dummied table format (see [9]) and generate the tetrachoric correlation matrix (see
[5]) for this table. To generate this matrix, we assume that each of our binary variables is
actually generated by a continuous one as follows:
xi = { 1 if xic > ti ,
0 else,
where xic is a latent continuous variable and ti is some threshold. The tetrachoric
correlation matrix is the correlation matrix of the variables xic .
Thus the input to our model is the dummied representation of user answers and its
output is the loadings matrix .
There are several methods to solve for . Examples include several E-M algorithms [13]
From to factors
Columns of correspond to factors and its rows to quiz answers. At this point of the
analysis, factors are mathematical abstractions. To transform them into meaningful
personality traits, we examine the highest and lowest loadings in each column. We then
study the answers identified by the rows of these loadings. These answers form a set of
images that allow us to interpret factors. Each factor is assigned meaning by our team of
psychologists.
There are two technical steps that our system performs before we begin manual factor
identication. The first is a rotation of the loadings matrix. Briefly, many s satisfy equation
1: if is a solution to equation 1, so is any rotation of (see [12] for more details). We want
to pick a rotation that will make the columns of maximally orthogonal. This matrix form
makes factor interpretation possible.
With the factors rotated, we measure their quality. This is done by comparing with a
loadings matrix that results from performing factor analysis on random data. We only keep
factors whose top and bottom loading are significantly different from random loadings.
In practice, this means that we keep factors with enough loadings above certain critical
values. The main ideas behind this approach are described in detail in [1] and [15].
Factor scores
Once the factors are identified, we assign each quiz attempt a score for each factor. There
are several methods for assigning factor scores [14]. We picked an approach that allows us
to perform scoring in real time and on large volumes of data. Let x = (x1, x2, . . . , xn) be a
binary representation of a quiz attempt. Then the transformation into the latent space -
the vector space spanned by the yis - is given by
On the following page are some examples of the density estimations for three factor
scores:
Observe that while the theoretical model predicts a N(0, 1) distribution for the yis,in
practice the distributions of factor scores are far from normal and vary from factor to
This makes the problem of finding a threshold that determines when to assign an emotive tag
to a user not trivial. VisualDNA has developed an effective in-house method for automatically
determining these thresholds. We note in passing that the process of assigning each user
a score, determining the cut offs and assigning emotive tags in essence clusters users into
groups defined by emotive tags. In fact, using factor analysis for clustering purposes is a
generalisation of the familiar k-means clustering algorithm [4].
[2] P.T. Costa and R.R. McCrae. Revised neo personality inventory and neo five-factor
inventory professional manual. Psychological Assessment Resources,1992.
[4] Chris Ding and Xiaofeng He. K-means clustering via principal component
analysis. In Proceedings of the twenty-first international conference on
Machine learning, page 29. ACM, 2004.
[7] Leandre R Fabrigar, Duane T Wegener, Robert C MacCallum, and Erin J Strahan.
Evaluating the use of exploratory factor analysis in psychological research.
Psychological methods, 4:272{299, 1999.
[9] Melissa A Hardy and Alan Bryman. Handbook of data analysis. SAGE Publications
Limited, 2004.
[11] Martin Knott and David J Bartholomew. Latent variable models and factor
analysis - A Unified Approach. Number 7. Edward Arnold, 2011.
[13] Donald B Rubin and Dorothy T Thayer. Em algorithms for ml factor analysis.
Psychometrika, 47(1):69{76, 1982.
[14] Ledyard R Tucker. Relations of factor score estimates to their use. Psychometrika,
36(4):427{436, 1971.
[15] Brett Williams, Ted Brown, and Andrys Onsman. Exploratory factor analysis: A five-
step guide for novices. Journal of Emergency Primary Health Care,8(3):1, 2012.