Anda di halaman 1dari 46

Victim or not?

A quantitative research into the predicting individual level and situational level factors of
computer crime victimization.

Master Thesis

Data Science: Business & Governance

Femke Smit

Anr / student number: 323895 / 1254055

Supervisor:
N. van Noord (Tilburg University)

Second reader:
Dr. G.A. Chrupala (Tilburg University)

May 2017
Preface

This master thesis is part of the ending phase of my master Data Science: Business and Governance.
With the completion of this thesis I will also end my academic career at Tilburg University.

While writing my master thesis for Strategic Management, I came into contact with the integrated
photonics technology. Integrated photonics is needed because a huge amount of data is created every
day and the electronic way of storing and transporting these amounts is not enough anymore.
Realizing that all that data can be analysed and be put to good use, made me decide to extend my
education at Tilburg University with one year by following the master Data Science: Business and
Governance.

By doing this research I have realized that the amount of data available can be very useful, but can
also be dangerous. Criminals can use this information to their own advantage which can lead to
financial loss and emotional damage for the victims of the crimes. This research made me realize that
it is important to get insights into which factors influence the risk of becoming a victim so that better
measures can be taken to prevent this. I hope the results of this research provide some insights and
will be helpful for researchers studying this issue in the future.

I would like to thank a few people that helped me during the process of writing my master thesis. First
of all, I want to thank my supervisor Nanne van Noord, for his feedback I got while writing my thesis.
His comments on how to deal with the dataset and how to perform the analyses correctly helped me
improve my master thesis. In addition, I would like to thank CentERdata for providing the informative
dataset which made researching this topic possible. Finally, I am very grateful to my parents for
supporting me and believing in me, and to my boyfriend Nick for the many neck massages whenever
Python stressed me out.

2
Abstract

This study focuses on the area of computer crime victimization and applies predictive modelling
instead of descriptive modelling. The general theory of crime and the lifestyles/routine activities
theory are used to find the predicting factors of computer crime victimization. From the general theory
of crime, the individual level factor self-control was extracted. From the lifestyles/routine activities
theory the situational level factors suitable target, exposure to motivated offender and capable
guardianship were extracted. In this study, three prediction tasks were performed, namely predicting
victimization in general, predicting victimization of person-based crimes and predicting victimization
of computer-based crimes. Here, person-based crimes are crimes aimed at an individual, such as
harassment, and computer-based crimes are crimes aimed at computers in general, such as malware
infections. To see whether the factors retrieved from the theories could be used to predict the three
types of victimization, three machine learning algorithms were applied, namely Logistic Regression,
Naïve Bayes, and Random Forest. The results showed that Logistic Regression performs best for all
three prediction tasks. However, the model performed less good for predicting victimization of person-
based crimes than for predicting victimization in general and predicting victimization of computer-
based crimes. The coefficients of Logistic Regression and the point biserial correlation coefficients
were studied to come to conclusions concerning feature importance. For all three prediction tasks,
capable guardianship was by far the most important feature. In addition, being a suitable target was of
importance when predicting victimization in general and victimization of person-based crimes.
Finally, self-control was of importance when predicting victimization in general and victimization of
computer-based crimes. Since this study dealt with an unbalanced dataset, where there were more non-
victims than victims, additional research with a more naturally balanced dataset is needed to confirm
or contradict the results found here. In addition, other algorithms can be tested to see which one
performs best on the prediction task.

3
Contents

Preface 2

Abstract 3

1. Introduction 6
1.1 Context 6
1.2 Motivation 7
1.3 Problem statement and research questions 8
1.4 Structure 9
2. Theoretical framework 10
2.1 Computer crime victimization 10
2.2 Traditional victimization theories 11
2.2.1 The general theory of crime 11
2.2.2 The lifestyles/routine activities theory 12
2.3 Individual level factors and computer crime victimization 13
2.4 Situational level factors and computer crime victimization 14
2.5 Methods used to study computer crime victimization 15
3. Method 17
3.1 Raw dataset 17
3.2 Sample 17
3.3 Pre-processing 19
3.3.1 Target variable 19
3.3.2 Input features 20
3.4 Experimental set-up 24
3.4.1 Training and test set 24
3.4.2 Unbalanced dataset 25
3.4.3 Algorithms 25
3.4.4 Evaluation criteria 27
3.5 Implementation 28
4. Results 29
4.1 Predicting victimization 29
4.1.1 Parameter optimization 30
4.1.2 Performance metrics 30
4.2 Predicting victimization of person-based and computer-based crimes 31
4.2.1 Predicting victimization of person-based crimes 31
4.2.2 Predicting victimization of computer-based crimes 31

4
4.3 Feature importance 32
4.3.1 Victimization 32
4.3.2 Person-based crime victimization 34
4.3.3 Computer-based crime victimization 35
4.4 Summary of the results 37
5. Discussion and conclusion 38
5.1 Answers to research questions 38
5.2 Contribution 41
5.3 Limitations and further research 41
6. References 44

5
1. Introduction

1.1 Context

Within the last few years, the use of computers has increased. We buy our clothes online, talk to
people on the other side of the world using chat rooms, and download our movies and music through
downloading websites. However, with the increased usage of computers also comes a new kind of
crime, namely computer crimes. Criminals can hack your computer to gather personal information or
harass you online while you are using a chat room (Ngo & Paternoster, 2011; Van Wilsem, 2013). In
addition, one can get scammed while buying clothes or supplies online, meaning you will pay for the
product bought but never receive them (Bossler & Holt, 2009). Anyone who uses a computer can
become a victim of these crimes. The question is whether certain factors can help predict which kind
of persons are more vulnerable to this type of victimization than others. Not much research has been
done into this area. However, the research that has been done uses traditional victimization theories
such as the general theory of crime (Gottfredson & Hirschi, 1990) and the lifestyle routine activity
theories (Cohen & Felson, 1979; Hindelang, Gottfredson & Garofalo, 1978).

Gottfredson and Hirschi’s (1990) general theory of crime shows that low levels of self-control can
enhance victimization risk and research by Holtfreter, Reisig, Piquero and Piquero (2010) has
indicated that this also counts for online victimization. However, the full body of empirical studies that
test this relationship has produced mixed results (Pratt, Turanovic, Fox & Wright, 2014). Some
studies have shown support for the predictive characteristic of self- while others showed mixed
support, so only for certain crimes control, or no support at all. For instance, Research by Franklin,
Franklin, Nobles and Kercher (2012) found that low levels of self-control result in an increase of
computer crime victimization for females, regardless of type of computer crime. In addition, research
by Bossler & Holt (2010) showed that low levels of self-control can predict person-based computer
crime victimization, but not computer-based victimization. Here, person-based offences target specific
individuals (e.g. obtaining password, online harassment) and computer-based offences target
computers in general (e.g. malware infection). This is in agreement with research by Van Wilsem
(2013), who showed that low levels of self-control lead to a higher risk of becoming a victim of
hacking and harassment.

The routine activity theory of Cohen and Felson (1979) and the lifestyle theory of Hindelang,
Gottfredson and Garofalo (1978) focus their attention on people’s behavioral routines that may lead to
an increased risk of victimization. Among others, Bossler and Holt (2009) and Newman and Clarke
(2003) showed that these theories can be applied to computer crime victimization as well. Although
Yar (2005) has contested these results by arguing that these theories are of limited use when
explaining computer crime victimization since “cybercrime is substantively different from traditional

6
victimization in that virtual environments are spatially and temporally disconnected and relatively
unstable, unlike physical space’’ (Bossler & Holt, 2010, p. 1), multiple researchers found evidence for
this application. For instance, Van Wilsem (2013) found a relationship between routine activities and
the victimization risk of hacking and harassment. In addition, research by Bossler, Holt and May
(2012) reported that people that are more present online and use protective software have an increased
risk of becoming a computer crime victim.

Previous studies mainly used descriptive modeling to describe relationships found between individual
and situational level factors and computer crime victimization. This research will go a step further by
applying predictive modeling to study whether individual levels factors and situational level factors
can predict computer crime victimization. The datasets on which the predictive modeling will be
applied were retrieved from the the LISS (Longitudinal Internet Studies for the Social sciences) panel
administered by CentERdata (Tilburg University, The Netherlands). The LISS panel is a
representative sample of Dutch members of households who participate in Internet surveys each
month. The datasets include background information, and information about victimization online and
offline and activities online and offline.

1.2 Motivation

The problem mentioned above is worth addressing for a few reasons. First of all, from a societal point
of view, the financial and emotional damages caused by computer crimes are substantial and online
harassment can cause economical losses and emotional stress (Bossler & Holt, 2010; Saini, Rao &
Panda, 2012). If there is more knowledge on which kind of person characteristics, or which situational
factors can predict computer crime victimization, better measures can be taken. Measures that can help
prevent becoming a victim and minimize the financial and emotional damages computer crimes can
cause.

Not much research has been done in the area of computer crime victimization. Most of the research
into computer crimes has focused on the execution of computer crimes, and the factors that influence
whether or not a person commits a computer crime. As a result, little is known about the factors that
predict computer crime victimization. In addition, research that has been done shows conflicting
results (Bossler & Holt, 2009; Bossler & Holt, 2010; Ngo & Paternoster, 2011; Pratt, Turanovic, Fox
& Wright, 2014; van Wilsem, 2013). This means that there is a need to expand the research into which
factors can predict computer crime victimization in order to create clear non-contradicting
conclusions.

Next to that, most of the research that has been done into computer crime victimization, has been
conducted mostly among students, and only in the United States. This means that the results are not
generalizable to population groups with a different age and nationality, since previous research has

7
only focused on one age group in only one area. Using a more representative sample with multiple age
groups and from a different country, which this study does, will increase generalizability.

Finally, previous research into computer crime victimization has been of a descriptive nature. This
study will contribute to academic research because it will use predictive modeling, instead of
descriptive modeling, to predict the factors that influence computer crime victimization. Machine
learning algorithms will be used to study whether victimization can be predicted by the two traditional
victimization theories.

To conclude, this study is of practical relevance since it can help minimize financial and emotional
damages. Next to that, this study is of scientific relevance since it expands existing research, it studies
a broad population group from a not yet researched country, and it uses a new approach, namely
predictive modeling instead of descriptive modeling.

1.3 Problem statement and research questions

The information in the previous section shows that different factors can be of influence when
becoming a victim of computer crimes. Therefore, this study discusses the following problem
statement:

Problem statement: Previous literature showed mixed results on which individual level factors and
situational level factors are related to computer crime victimization. In addition, previous literature
only describes the possible relationships between the variables, but does not study whether the factors
can predict computer crime victimization..

In order to find an answer to the problem statement, it will first be discussed whether the traditional
victimization theories can be applied to computer crime victimization. This because the victimization
theories are traditionally applied to offline crimes such as offline harassment or burglary. Before
applying the theories to computer crime victimization, previous literature needs to confirm that it is
agreeable to do this. This results in research question 1.

Research question 1: To what extent can the general theory of crime and the lifestyles/routine
activities theory be used for studying computer crime victimization?

After discussing to what extent the theories can be applied to computer crime victimization, the
predicting individual level factors, resulting from the theory created by Gottfredson and Hirschi
(1990), will be explored. The focus will be on the factor self-control and whether different levels of
self-control will lead to computer crime victimization.

Research question 2: To what extent can individual level factors predict computer crime
victimization?

8
In addition to individual level factors, literature has shown that situational level factors can also play a
role in the victimization of computer crimes. So in order to formulate a conclusion to the problem
statement, the extent to which situational level factors can predict computer crime victimization also
need to be discussed. This results in research question 3.

Research question 3: To what extent can situational level factors predict computer crime
victimization?

1.4 Structure

The remaining of this research will be structured as followed. Chapter 2 contains a review on previous
studies into computer crime victimization and will discuss relevant theory needed to answer research
questions one, two and three. Chapter 3 describes the experimental setup with a clear description of
the dataset, the sample and variables used, the methods used and the evaluation criteria applied.
Chapter 4 discusses the experimental results of the analyses performed. Finally, chapter 5 will discuss
the answers to the research questions that result from the analyses done, discuss the limitations of the
study, and give recommendations for future research.

9
2. Theoretical framework

In this chapter, previous relevant literature is explored. A review on previous studies helps understand
the victimization area and can identify the factors that are related to computer crime victimization. In
addition, literature can provide insights into how previous analyses have been executed and in what
way this study can contribute to prior literature.

First of all, computer crime and computer crime victimization will be discussed (2.1). Secondly, the
two traditional victimization theories will be introduced and it will be discussed to what extent they
can be expanded from offline crime victimization to online computer crime victimization (2.2). Next,
section 2.3 and 2.4 will illustrate the individual level factors and the situational factors related to
computer crime victimization. Finally, section 2.5 will discuss which techniques were used in previous
literature to study this problem and to what extent machine learning algorithms can be used.

2.1 Computer crime victimization

Today’s society depends heavily on the use of computer and computer technologies (Choi, 2008).
Because of improved technologies more information is processed online. The processing of
information online leads to the information being more easily accessible to others (Saini, Rao &
Panda, 2012). This increased use and dependency on computers, and their consequences, are seen as
an opportunity by criminals to perform computer crimes (Choi, 2008; Saini, Rao & Panda, 2012).
Computer crimes are: “crimes that involve computers and networks, but require more than a basic
level of computer operating skills for offenders to commit these crimes successfully against the
victims” (Choi, 2008, p. 309). Computer crimes can have multiple effects on society. One of them is
economic loss. In 2010, over 74 million people in the United States were a victim of computer crimes
and as a result had $32 billion financial loss (Saini, Rao & Panda, 2012). In addition, computer crimes
aimed at companies can have massive losses as a result, which can have an influence on the economic
market as a whole (Hua & Bapna, 2013). Next to economic loss, computer crimes can have
psychological damage as a consequence. This can result in emotional problems (Bossler & Holt,
2010), but also in reduced (customer) trust which in turn can lead to decreased online purchases and
thus in economic loss (Saini, Rao & Panda, 2012).

It is hard to estimate the exact number of computer crimes and the monetary loss as a result, since
crimes are rarely detected by the victims and consequently not reported to authorities (Choi, 2008). In
addition, the quality of the crimes executed, where the criminals manage to stay anonymous or use
encryption devices, makes it difficult to identify and punish the offenders (Yar, 2005). For these
reasons it is useful to study the users of computers, and thus the possible victims of computer crimes.
If there is more insight into which factors can influence computer crime victimization, better measures
can be taken to prevent this.

10
Since online crimes are relative new area of research, and crimes traditionally were performed offline,
no clear theories are provided yet for online crime victimization. However, the theories available for
traditional offline crime victimization have been applied in previous literature to study whether the
same factors play a role in computer crime victimization (Bossler & Holt, 2009; Bossler & Holt, 2010;
Choi, 2008; Holtfreter, Reisig & Pratt, 2008; Ngo & Paternoster, 2011; Van Wilsen, 2013).

2.2 Traditional victimization theories

In this section the two traditional victimization theories will be introduced. First, the general theory of
crime by Gottfredson and Hirshi (1990) will be discussed. Next, the lifestyles/ routine activities theory
by Hindelang et al. (1978) and Cohen and Felson (1979) will be explained. In addition, this section
elaborates on whether the theories can be applied to computer crime victimization, next to being
applied to offline crime victimization. By doing this, research question 1, “To what extent can the
traditional victimization theories be applied to computer crime victimization?”, will be answered.

2.2.1 The general theory of crime

The general theory of crime was created by Gottfredson and Hirschi (1990) and tries to explain the
individual differences that cause a person to refrain from or to commit a crime, independent of the
circumstances. According to the theory, the most important individual factor that causes crime is low
self-control. The analysis of Pratt and Cullen (2000) supports this. Persons with low self-control tend
to follow their own interests, have no personal restraint, frequently engage in risk-seeking behaviour
and do not take the consequences of their behaviour into consideration (Gottfredson & Hirschi, 1990;
Holtfreter, Reisig & Pratt, 2008; Ngo & Paternoster, 2011). Although the theory was meant as a
general theory of crime and not necessarily applicable to crime victimization, different scholars
support the extension of the theory to victimization (Piquero, MacDonald, Dobrin, Daigle & Cullen,
2005; Schreck, 1999; Stewart, Elifson & Sterk, 2004). People with low levels of self-control place
themselves in risky situations and behave in an imprudent way, which increases crime opportunities,
but also victimization opportunities (Bossler & Holt, 2010). Next to that, individuals with low levels
of self-control are more likely to engage themselves with criminals, which again increases not only the
likelihood of committing a crime, but also the risk of becoming a victim (Higgins, Fell & Wilson,
2006). Also, persons with low levels of self-control do not take precautions to prevent victimization,
which again increases the possibility of becoming a victim (Schreck, 1999). Finally, Gottfredson and
Hirshi (1990) themselves concluded that there is a high level of correlation between offending and
victimization and argue that low levels of self-control indeed can lead to higher victimization risk.

Although the general theory of crime is traditionally applied to offline crime victimization, previous
research indicates that the connection between self-control and online crime victimization is also
plausible. Research has shown that individuals with low levels of self-control are more likely to

11
engage in risky online behaviour, such as viewing pornography (Higgins, 2005; Higgins, Wolfe &
Marcum, 2008). By doing this, they put themselves in situations where it is more likely to become a
victim. Next to that, research shows that individuals with low levels of self-control are more likely to
commit various types of computer crimes, such as digital piracy and software piracy (Buzzell, Foss &
Middleton, 2006; Higgins, 2005; Higgins, Wolfe & Marcum, 2008). By doing this, they consequently
increase the possibility of becoming a victim of crimes, since your proximity to crime becomes larger
by committing crimes yourself (Bossler & Holt, 2010). Thus, it is plausible to use the level of self-
control as a predictor for computer crime victimization.

However, when using self-control as a predictor, the consistency in level of self-control over time
needs to be taken into account. This because Gottfredson and Hirschi (1990) state that although the
level of self-control can change a bit over time, the relative level of self-control, or ‘ranking’, does not
change.

2.2.2 The lifestyles/routine activities theory

The lifestyles/routine activities theory consists of the lifestyle theory by Hindelang et al. (1978) and
the routine activity theory by Cohen and Felson (1979). The lifestyle theory focuses on demographic
differences that increase the risk of victimization, while the routine activities theory focuses on spatial
and temporal characteristics (Bossler & Holt, 2009; Cohen & Felson, 1979; Hindelang et al., 1978).
However, it has been argued that the routine activity theory is an extension of the lifestyle theory since
they both discuss how daily routine activities or lifestyles increase the risk of becoming a victim or
committing a crime (Ngo & Paternoster, 2011). Next to that, the lifestyle theory was introduced first,
and the routine activity theory addresses the theoretical element of the lifestyle theory that increases
the risk of becoming a victim, namely being a suitable target, but also adds the elements: motivated
offender and absence of a guardian (Bossler & Holt, 2009; Cohen & Felson, 1979; Ngo & Paternoster,
2011). In essence this means that an individual that spends time with criminals (suitable target) and
does not have a capable guardian that can protect him/her (guardianship), is more likely to come in
contact with a person with the intention and ability to commit a crime (motivated offender), and thus
becomes a victim more easily (Bossler & Holt, 2009; Cohen & Felson, 1979; Yar, 2005). When one of
these elements is not present, the probability of becoming a victim decreases (Choi, 2008). Since these
two theories are often combined and applied together in the literature, they are known as the
lifestyles/routine activities theory.

Previous results on the applicability of the lifestyle/routine activities theory to online crime
victimization have been mixed. Since the theory consists of the three elements suitable target,
motivated offender and capable guardianship, it will be discussed whether these three elements are
present online, and thus increase the risk of victimization online. First of all, research by Choi (2008)
shows in an online setting you can speak of a suitable target. Namely, an individual is a suitable target

12
when he or she has an unsafe routine computer usage. Furthermore, the individual is a suitable target
when he or she engages with criminals online and provides correct personal information. These
findings are supported by Marcum (2008). Concerning the exposure to motivated offenders, in an
offline setting, the amount of time you spent on the street does not influence whether you are exposed
to them, but rather in which streets you walk and on what time (Cohen & Felson, 1979; Hindelang et
al., 1978). Since this is also the case in an online setting, where the amount of time spent on the
computer does not influence victimization risk, but rather the kind of activities and the amount of time
an individual engages in that activity (Ngo & Paternoster, 2011), it can be concluded that this element
of the theory can also be applied to computer crime victimization. Finally, in an offline setting capable
guardianship refers to the presence of a guard that can help individuals from engaging in activities that
can lead to victimization (Cohen & Felson, 1979; Hindelang et al., 1978). It can be argued that such a
person can also prevent a person from engaging in risky activities online (Ngo & Paternoster, 2011).
In addition, in an online setting, computer software such as firewalls and virus scanners are present.
Since this kind of software helps preventing falling victim to a computer crime, it can be seen as a
guardian. This indicates that a capable guardian is also present online. It conclusion, all three elements
of the theory are present when using a computer, and thus the theory can be applied to computer crime
victimization.

2.3 Individual level factors and computer crime victimization

According to literature, the most important individual level factor, retrieved from the general theory of
crime, causing crime victimization is self-control (Gottfredson & Hirschi, 1990; Pratt & Cullen, 2000).
This section will discuss to what extent self-control is related to computer crime victimization, and
thus answer research question 2.

Previous research has been done into the relationship between self-control and the risk of becoming a
computer crime victim. In those studies, multiple different crimes are discussed. These crimes can be
placed into two groups namely: person-based crimes and computer-based crimes. Person-based crimes
are crimes where a specific individual is the target, such as online harassment and changing someone’s
password, and computer-based crimes are crimes where the computer in general is the target, such as a
malware infection or credit card theft (Bossler & Holt, 2010). The two categories were first introduced
by Bossler and Holt (2010), and will be used in this study. Numerous researchers have reported a
significant relationship between self-control and person-based crimes. First of all, Bossler and Holt
(2010) showed that low levels of self-control increase the risk of becoming a person-based computer
crime victim, but that there is no effect on the risk of becoming a computer-based crime victim. This
result is in agreement with research by Ngo and Paternoster (2011), who reported that the level of self-
control has a significant effect on the risk of getting harassed online (person-based crime) and no
effect on the risk of falling victim to any forms of computer-based crimes. In addition, research by

13
Morris (2010) and Bossler and Burruss (2011) showed that there is a significant relationship between
low levels of self-control and becoming a victim of hacking (person-based crime). Finally, van
Wilsem (2013) stated that individuals with low levels of self-control are more likely to be hacked and
harassed, which both are person-based crimes.

As mentioned above, a lot of research indicates that self-control influences the risk of becoming a
victim of person-based crimes, and that no relationship was found between self-control and
victimization of computer-based crimes. However, research by Holtfreter et al. (2008), showed that
individuals with low levels of self-control are more often a victim of consumer fraud (computer-based
crime). Research by Holtfreter et al. (2010), confirmed this finding by showing that fraud occurs more
often among people with low levels of self-control.

In conclusion, it is expected that self-control is good at predicting victims of person-based crimes, but
average at predicting victims of computer-based crimes. This because previous research indicates
enough evidence for the relationship between self-control and the risk of becoming a victim of person-
based crimes, but scarce evidence for the relationship with computer-based crime victimization.

2.4 Situational level factors and computer crime victimization

The three elements of the lifestyles/routine activities theory, suitable target, motivated offender and
capable guardianship, can be seen as situational level factors (Cohen & Felson, 1979; Hindelang et al.,
1978). However, whether the three elements have the same effect on victimization online as they do
offline has to be discussed. Research question 3, “To what extent can situational level factors predict
computer crime victimization?”, will be answered by doing this.

According to research, you are a suitable target when you engage in certain risky activities, for
instance click on pop up messages or open unknown email attachments (Choi, 2008; Marcum, 2008),
spent a lot of time online and provide correct personal information (Holt & Bossler, 2009). Multiple
studies have found evidence showing that being a suitable target increases the risk of becoming a
computer crime victim. First of all, Bossler and Holt(2009) found that when a person has the habit of
spending a lot of time in online chat rooms, the risk of getting harassed online (person-based crime)
increases. In addition, Choi (2008) showed that individuals that spent a lot of time on a computer and
engage in risky behaviors, get victimized more often. Finally, Hinduja and Patchin (2008) discovered
that high computer proficiency, meaning that a person spends a lot of time with a computer, increases
the risk of becoming a victim of cyberbullying (person-based crime).

Concerning exposure to a motivated offender, Bossler and Holt (2009) showed that a person’s general
computer use, like checking email or online shopping, does not significantly influence the risk of
getting in contact with offenders, but rather the number of hours spent in chat rooms or using Instant
Messaging did. This finding is supported by Ngo and Paternoster (2011), who reported that the hours

14
spend on Instant Messaging, increases the amount of time exposed to motivated offenders, and
ultimately increases harassment (person-based crime).

The third and final element of the theory, capable guardianship, takes the form of antivirus software or
firewalls in an online setting (Bossler & Holt, 2009; Choi, 2008;Ngo & Paternoster, 2011). Research
has showed mixed results regarding the relationship between capable guardianship and computer
crime victimization. For instance, Choi (2008) showed that having antivirus or firewall software is
related to victimization of computer crimes. In contrast, Marcum (2008) reported that no relationship
could be found between capable guardianship and computer crime victimization. Furthermore, Bossler
& Holt (2009) showed that no relationship could be found between capable guardianship and malware
infections (computer-based crime).

Not all studies mention the crimes to which individuals can fall victim studied. However, the ones that
do mention the crimes, report that a relationship is found between the situational level factors and
person-based crimes but not between situational level factors and computer-based crimes. This is the
case for all three elements. Thus it expected that the situational level factors are good predictors for
victimization of person-based crimes, but less good for victimization of computer-based crimes.

2.5 Methods used to study computer crime victimization

Previous studies have looked into the relationship between self-control and computer crime
victimization, and lifestyles / routine activities and computer crime victimization. All studies used
their own methods to do this. Bossler and Holt (2009) used a correlation matrix and logistic regression
models to check the relationship between lifestyles/routine activities and computer crime
victimization. They argued that logistic regression was appropriate to use since the dependent variable
was dichotomous and skewed. Holtfreter et al. (2008) agreed with this argument, and also used the
combination of correlation and logistic regression to study the relationship between self-control and
computer crime victimization, and lifestyles/routine activities and computer crime victimization. In
addition, Ngo and Paternoster (2011) as well as Van Wilsem (2013) used logistic regression to study
the relationship. All these studies included control variables as age, sex and educational level in their
analyses.

It can be concluded that previous research into the area has mainly focused on using descriptive
modelling to come to a conclusion. Based on these results, nothing can be said about the rate of
succession of applying machine learning methods to predict computer crime victimization, since it
simply has not been done before. However, machine learning methods have been applied when
studying factors related to committing offline crimes and offline crime victimization. For instance,
Riesen and Serpen (2009) used the Bayesian belief networks to predict victimization in the National
Crime Victimization Survey. Algorithms that fall under this network are Naïve Bayes, K2 algorithm

15
and hill climbing (Riesen & Serpen, 2009). They reported that is it feasible to successfully develop a
Bayesian belief network classifier to predict victimization. In addition, Neuilly, Zgoba, Tita and Lee
(2011) have used Regression Trees to predict relapses in homicide offenders and Yearwood and
Mammadov (2010) have used Support Vector Machine to profile phishing emails.

Still, the application of machine learning methods in the areas of offline crime (victimization) and
online crime victimization remains scarce. However, since machine learning algorithms successfully
have been applied to areas similar to computer crime victimization, is it expected that they will also be
successful here. In addition, since previous studies included control variables such as sex, age and
educational level, they will be included in the algorithms as input features.

16
3. Method

3.1 Raw dataset

In this study, multiple datasets are used. These datasets were retrieved from the LISS (Longitudinal
Internet Studies for the Social sciences) panel administered by CentERdata (Tilburg University, The
Netherlands). The LISS panel is a representative sample of Dutch members of households who
participate in Internet surveys each month. An overview of the datasets used and relevant information
about the datasets can be found in Table 1.

Dataset Date Sample size Nr. of features Age of


extracted respondents
Background variables February 2012 11616 2 0 – 98 years
Conventional and February 2008 6897 29 16 years and
computer crime older
victimization, wave 1
Conventional and February 2010 5764 108 16 years and
computer crime older
victimization, wave 2
Conventional and February 2012 5709 108 16 years and
computer crime older
victimization, wave 3
Social Integration and February 2008 7369 5 16 years and
leisure, wave 1 older
Social Integration and February 2010 6415 5 16 years and
leisure, wave 3 older
Social Integration and February 2012 5994 7 16 years and
leisure, wave 5 older
Table 1. Raw datasets used and relevant information.

The dataset “Background variables” includes demographics of the members of the households that
participate in the different surveys of the LISS panel. Examples of demographics are number of
household members, sex, age, educational level and main occupation. The three waves of the dataset
“Conventional and computer crime victimization“ include questions about falling victim to offline
crimes, acitivities online, falling victim to online crimes, actions taken after falling victim and
questions that measure self-control. Finally, the “Social Integrations and leisure” datasets include
questions about what respondents do in their free time. For instance, what kind of sports they practice,
which social activities they engage in, whether they provide care for others or do voluntary work, how
they spend their holidays and what their main activities on a computer are.

3.2 Sample

The seven datasets were merged based on the ID number of the respondents. However, not all
respondents were used to predict victimization. Respondents that showed no consistency in their level

17
of self-control over the period 2008 till 2012, were removed. Paragraph 3.3.2 elaborates on this, and
explains how respondents were filtered. The final dataset included 4873 respondents and 14 variables.
The target variable in this study was victimization. This was a categorical variable, and shows
whether or not a respondent has fallen victim to person-based crimes or computer-based
crimes in the period 2008-2012. The information needed to create this variable was retrieved
from wave 1, wave 2 and wave 3 of the Conventional and computer crime victimization
dataset. The input variables and a short description of them can be found in Table 2. Paragraph 3.3
elaborates on how the variables were created.

Nr Variable Retrieved from Description


which datasets
0 sex Background Dichotomous variable, Male (0) and Female
variables (1)
1 educational_level Background Categorical variable, different highest
variables achieved educational levels.
2 self_control Conventional and Continuous variable, self-control score,
computer crime averaged over the period 2008-2012.
victimization, wave
1/wave2/wave3
3 surname Conventional and Dichotomous variable, whether or not a
computer crime person has one or more accounts on social
victimization, media and has filled out their surname
wave2/wave3 truthfully in the period 2008-2012.
4 age Conventional and Dichotomous variable, whether or not a
computer crime person has one or more accounts on social
victimization, media and has filled out their age truthfully in
wave2/wave3 the period 2008-2012.
5 address Conventional and Dichotomous variable, whether or not a
computer crime person has one or more accounts on social
victimization, media and has filled out their address
wave2/wave3 truthfully in the period 2008-2012.
6 telephone_nr Conventional and Dichotomous variable, whether or not a
computer crime person has one or more accounts on social
victimization, media and has filled out their telephone
wave2/wave3 number truthfully in the period 2008-2012.
7 email Conventional and Dichotomous variable, whether or not a
computer crime person has one or more accounts on social
victimization, media and has filled out their email address
wave2/wave3 truthfully in the period 2008-2012.
8 pictures Conventional and Dichotomous variable, whether or not a
computer crime person has one or more accounts on social
victimization, media and has posted pictures of self in the
wave2/wave3 period 2008-2012.
9 hours_internet Social Integration Continuous variables, hours per week spent
and leisure, wave on the Internet averaged over the period 2008-
1/wave3/wave5 2012.

18
10 hours_IM Social Integration Continuous variables, hours per week spent
and leisure, wave on IM averaged over the period 2008-2012.
1/wave3/wave5
11 protect_standard Conventional and Dichotomous variable, whether or not a
computer crime person has only installed the standard
victimization, wave measures firewall and virus scanner, before
1/wave2/wave3 falling victim to a crime in the period 2008-
2012.
12 protect_extra Conventional and Dichotomous variable, whether or not a
computer crime person has only installed the extra measures
victimization, wave anti-spyware, Trojan scanner, spam filter and
1/wave2/wave3 wireless network security, before falling victim
to a crime in the period 2008-2012.
13 protect_standard_ Conventional and Dichotomous variable, whether or not a
extra computer crime person has installed the standard and extra
victimization, wave measures before falling victim to a crime in
1/wave2/wave3 the period 2008-2012.
Table 2. Input variables used in analyses and a short description.

3.3 Pre-processing

Since the LISS (Longitudinal Internet Studies for the Social sciences) panel administered by
CentERdata (Tilburg University, The Netherlands) distributes already cleaned and well organized
datasets, not a lot of data cleaning had to be done. In addition, since the variables needed for this study
were mainly categorical or binary variables, controlling for outliers was only needed for two
variables, namely the two measuring exposure to motivated offenders which were hours spent on the
Internet per week and hours spent on IM per week. If the number of hours was larger than 120, the
answer was removed. This was done since any numbers larger than 120 are implausibly high,
considering the average person sleeps around eight hours a day, leaving only 112 hours for other
activities. The next paragraphs will elaborate which features were included in the sample and how they
were created. In addition, descriptive statistics will be given.

3.3.1 Target variable

Computer crime victimization

In this study, the feature to be predicted was whether or not a person has become a victim of computer
crimes. In the questionnaire, the respondents were asked whether they have fallen victim to a number
of online crimes, for example hacking or credit card theft. The possible responses to these question
were yes(1) or no (2). For the purpose of this study, one categorical variable was created. The
categories in this variable were no victim, victim of person-based crimes or victim of computer based
crimes. The variable took the value 0 if the respondent had not been a victim of one of the computer
crimes, 1 if the respondent was a victim of person-based crime and 2 if the respondent was a victim of

19
computer-based crimes. This all in the period from 2008-2012. Figure 1 shows the number of people
in each category.

Figure 1. Barplot showing the number of people in each victimization category

As can be seen in the Figure 1 the dataset is unbalanced. There are a lot more non-victims than
victims. Also, the number of victims of person-based computer crimes is very small. This needs to be
taken into account when performing the analyses.

3.3.2 Input features

Self-control

In this study, self-control was measured using 12 items of dysfunctional impulsivity from the Dickman
Impulsivity Inventory. This is a suitable scale since it measures the same factors that Gottfredson and
Hirschi (1990) describe in their definition of self-control, namely whether people behave in a more
risky way and take decisions without considering the consequences (Dickman, 1990). To measure
self-control, the respondents were asked 12 questions. Examples of the questions were: “I often say
and do things without considering the consequences”, “Often, I don’t spend enough time thinking over
a situation before I act”, and “I often make up my mind without taking the time to consider the
situation from all angles”. The answer to the questions could be no (1) , yes (2), or NA. The same 12
items were used in wave 1 (2008), wave 2 (2010) and wave 3 (2012).

The questions asked measure low self-control, meaning that answering yes lowers your level of self-
control. In order to properly calculate the level of self-control, three questions needed to be rescaled,
since answering yes to these questions would lead to a higher level of self-control. The questions that
were rescaled were “I enjoy working out problems slowly and carefully”, “Before making any

20
important decisions, I carefully weigh up the pros and cons”, and “I am good at careful reasoning”.
Here, rescaling means setting 1 to 2 and 2 to 1.

When selecting respondents, the following criteria were used. First of all, respondents had to show
consistency in the level of self-control, since Gottfredson and Hirschi (1990) state that although the
level of self-control can change a bit over time, the relative level of self-control, or ‘ranking’, does not
change. Secondly, respondents had to be present in at least two waves, in order to check for
consistency. In order to select the respondents, the following steps were executed. First, in the three
separate waves, the self-control score for each person was calculated by taking the average of the
number of items answered. After that, the respondents were placed in the category “high” (score ≥ 1.5)
or “low” (score < 1.5). Next, the respondents with the same scores for the waves in which they were
present, were selected (so three times “high” or three times “low”). When the final respondents were
chosen, the answers to the items in the three waves were merged, and the final self-control score was
calculated over the waves in which the respondents were present.

Figure 2 shows the density plot representing the distribution of level of self-control in the sample used
in this study. The total number of respondents in the sample were 4873. The bandwidth in the plot is
0.01536. As shown in the plot, most of the scores are between 1.00 and 1.20. This means that a large
number of the respondents had a low level of self-control (<1.5).

Figure 2. Density plot of the level of self-control in the sample that is used for analyses.

Suitable target

According to the literature, whether a person is a suitable target depends on their activities online. If a
person gives a lot of personal information, he or she can become a target more easily. Therefore, in
this study, it was checked whether a person is a suitable target by looking if respondents filled out
information online truthfully. In the questionnaire, respondents were asked on which social platforms
they had an account. Examples of social platforms were Facebook, LinkedIn, sugardudes and
Waarbenjij.nu. If respondents indicated having an account on a platform, they were asked which
information they filled out truthfully, like surname, age and email address. The possible answers to

21
these questions were yes (1) or no (0). Any NA’s were set to 0. In this study, multiple dichotomous
variables were created were the respondent got the value 1 if he or she has had one or more accounts
online and had truthfully filled out the information. Else, the respondent got the value 0. This variable
is only measured in the period 2010 till 2012, since in wave 1 (2008) it was not asked whether or not
respondents had filled out information truthfully.

Table 3 shows how many persons filled out which information truthfully on minimally one platform.
As can be seen in the table, most people filled out there surname and age truthfully. In addition, almost
27% included pictures of themselves. A relatively lower number of people provided their correct
address and telephone number.

Kind of information Number of persons


Surname 1474 (30.2%)
Age 1455 (29.9%)
Address 356 (7.3%)
Telephone number 346 (7.1%)
Email address 1087 (22.3%)
Pictures of self 1311 (26.9%)
Table 3. Distribution of respondents filling in which kind of information truthfully, in numbers and
percentages

Motivated offender

Exposure to motivated offenders was measured by looking at the hours per week that were spent using
the Internet, and the hours per week that were spent chatting or using IM. Respondents were asked
how much hours per week they use the internet at home, at work, at school or someplace else. The
numbers entered were add up, to come to a total number of hours spent on the Internet. Also, outliers
were removed (number of hours larger than 120) and NA’s were set to the mode for the variable per
wave. Two continuous variables were created where the hours spent on the Internet per week and the
hours spent using IM per week were averaged over the 5 years (2008-2012).

Figures 3 and 4 show the density plots representing the number of hours spent on the internet per week
and the hours spent per week using IM per week. Most respondents spend between 0 and 20 hours per
week on the internet. In addition, they spend very little time using IM (around 2 hours).

22
Figure 3. Density plot showing the hours spent per week on the internet (averaged over the period
2008-2012).

Figure 4. Density plot showing the hours spent per week on IM (averaged over the period 2008-2012).

Capable guardianship

The third and last factor of the routine activities theory is the absence of a capable guardian. To check
for a capable guardian, the respondents were asked whether they took measures to protect their
computer. In addition, respondents were asked whether they had installed the measures before or after
falling victim to a computer crime. The possible answers to these questions were yes (1), no (2) and I
don’t know/prefer not to say (3). Any NA’s were set to 0. Six measures were mentioned, which in this
study were divided in “standard measures” and “extra measures”. The standard measures included a
firewall and a virus scanner, since most computers have these programs installed already. The extra
measures included anti-spyware, Trojan scanner, spam filter and wireless network security, since these
programs require some extra effort to install. Three dichotomous variables were created which
described whether the respondent had only taken standard measures, only taken extra measures or had
taken both standard and extra measures. All three variables had the values 1 and 0. The variables got
the value 1 if the measures in that category were taken, and if they were taken before falling victim to
a computer crime in the period 2008 till 2012. The variables got the value 0 if none of the measures in
the category were taken, or were taken after falling victim in these 5 years.

23
Table 4 shows the distribution of measures taken before falling victim to a computer crime among the
respondents in the period 2008 till 2012. Most of the respondents had no protection installed on their
computer. However, most of the respondents that had protection installed, had as well standard as
extra protection.

Category Number of persons


No protection 3210 (65.8%)
Protection 1663 (34.2%)
Only standard protection 199 (4.1%)
Only extra protection 53 (1.1%)
Extra and standard protection 1411 (29.0%)
Table 4. Distribution of protection measures taken, in numbers and percentages.

Demographics

Demographics used as input in this study were sex and educational level. Age could not be included
since the final dataset included information about a period of 5 years. The variables were retrieved
from the dataset Background Variables (2012). Table 5 illustrates how many males and females were
present in the dataset, and shows that the distribution is rather balanced. Table 6 shows how many
people belonged to which educational level.

Category Number of persons


Male (0) 2265 (46.5%)
Female (1) 2608 (53.5%)
Table 5. Number of males and females in the dataset used, in numbers and percentages.

Category Number of persons


Primary school (0) 533 (10.9%)
VMBO (Intermediate secondary education, US: 1267 (26.0%)
junior high school) (1)
HAVO / VWO (Higher secondary education / 324 (6.6%)
preparatory university education, US: senior high
school) (2)
MBO (Intermediate vocational education, US: 1114 (22.9%)
junior college) (3)
HBO (Higher vocational education, US: college) 1215 (24.9%)
(4)
WO (University) (5) 420 (8.6%)
Table 6. Distribution of highest achieved educational level in the dataset used, in numbers and
percentages.

3.4 Experimental set-up

3.4.1 Training and test set

Before a model can be made, the data have to be split into a training set and a test set. When splitting
the dataset, a 80/20 division was used. This means that 80% of the instances construct the training set

24
and 20% of the instances construct the test set. The training set will be used to train a model and the
test set will be used to see how well the model does on unseen data (Kuhn & Johnson, 2013).

3.4.2 Unbalanced dataset

As was discussed in section 3.3.1 of the method, the dataset suffers from class imbalance. The classes
of interest, namely the computer crime victims, are the classes with the lowest number of instances.
Literature offers a number of ways for dealing with class imbalance. On the data level, one can use
over-sampling or under-sampling. Under-sampling tries to achieve a more balanced data set by
removing instances with the majority class (Kotsiantis, Kanellopoulos, & Pintelas, 2006). A drawback
of this method is that potentially useful data is removed, which can lead to decreased algorithm
performance. Over-sampling is a method where random replications of the minority class examples
are produced (Kotsiantis, Kanellopoulos, & Pintelas, 2006). However, this can increase the likelihood
of overfitting.

Chawla, Bowyer, Hall and Kegelmeyer (2002) describe a different method that can be used, namely a
combination between under-sampling and over-sampling. This method is called SMOTE. Chawla et
al. (2002) suggest generating synthetic minority class examples where new minority class examples
are created by interpolating between several minority class examples that lie together, instead of
generating exact copies of minority class examples. The synthetic samples are created like this: “Take
the difference between the feature vector (sample) under consideration and its nearest neighbor.
Multiply this difference by a random number between 0 and 1, and add it to the feature vector under
consideration” (Chawla et al., 2002, p. 328). After the synthetic over-sampling is performed, the
under-sampling will take place. Research has shown that combining the SMOTE method with an
under-sampling method outperforms other data level and algorithm level methods for handling
imbalance since “the initial bias towards the majority class is reversed in the favor of the minority
class (Chawla et al., 2002, p. 331). Thus a combination between SMOTE and Tomek will be used.
Tomek’s algorithm looks for pairs of opposing instances that are very close to each other, and removes
the majority instances of that pair. By doing this, the boundaries between the minority and majority
classes are clarified (Batista, Bazzan & Monard, 2003). The sampling will only be applied to the
training data.

3.4.3 Algorithms

The prediction task performed in this study is determining whether a respondent is a victim of
computer crimes or not. In order to do this, the target variable, which was originally categorical, is
made binary where 0 means no victim and 1 means victim. The respondents will be classified using
the algorithms Logistic regression, Naïve Bayes, and Random forest. The next paragraphs will
elaborate on these algorithms. Optimal parameters for the algorithms were found using grid search and

25
cross-validation (10-fold). Next, the models were trained using these parameters and cross-validation
(10-fold). Finally, the models are applied to the test set, thus unseen data, and Kappa and the Area
under the ROC curve were calculated to check model performance (Kuhn & Johnson, 2013).

Logistic Regression

Ordinary linear regression is not suitable here since that method allows the target variable to be
continuous and thus take on other values than the categories given. However, a form of regression
which has discrete variables as outcomes is logistic regression. Logistic regression is a method used
when the target variable is a binary feature. Binary logistic regression has the output p, which is the
probability of belonging to a certain class. In addition, binary logistic regression uses the log of odds
that an event occurs and is created by log(p/(1-p)) (Kuhn & Johnson, 2013).

Penalties can be used in logistic regression to control overfitting. L2 regularization penalty penalizes
weight variance (Daumé, 2012), and makes the trade-off between a fitted model and a simple model.
Thus, L2 regularization penalty decreases the possibility of the model overfitting on the training data,
and increases generalizability to unseen data.

Naïve Bayes

Naïve Bayes is a special case of Bayesian belief networks (Riesen & Serpen, 2009). The algorithm
assigns the most likely class to an instance, based on features given. In addition, the algorithm
assumes that features are independent given class (Rish, 2001). Although this might be an unrealistic
assumption, the Naïve Bayes classifier has shown to be successful in practice (Rish, 2001).

The classifier has no parameters, thus no parameter optimization has to take place. Naïve Bayes is
used here since Riesen and Serpen (2009) have showed it was successful when predicting whether a
person has been a victim of offline crimes. Since this area is similar to the area studied in this research,
it is expected to be a good algorithm for the prediction task in this study.

Random forest

Random forest is a collection of decision trees. Decision trees learn by using if-then-else rules
(Daumé, 2012). Trees are built incrementally by deciding which questions are most important to ask.
The speed of the algorithm depends on the number of questions that need to be asked, or in other
words the depth of the tree. By limiting the depth of the tree, overfitting is prevented (Daumé, 2012).
It could be better to use Random forest instead of Decision Trees, since it can help prevent overfitting,
which can happen more easily with a Decision Tree (Daumé, 2012). With a Random forest, a
prediction is made by taking the average of the answers. This means either the majority vote is chosen
as predicted class, or the class with the highest average probability.

26
The parameters to be tuned for the Random forest are the number of randomly selected features to
choose from at each split and the number of trees. When choosing the parameter mentioned first, it is
suggested to choose values that are evenly spaced between 2 and the number of predictors (Kuhn &
Johnson, 2012). For choosing the number of trees, trade-offs need to be made between getting a good
performance and processing time. This is the case because the number of trees the algorithm has to
create is associated with computational costs. A large number of trees needs lot of processing time and
memory usage, and thus possibly high computational costs. According to Oshiro, Perez and
Baranauskas (2001) good performance and reasonable processing time can be reached when the
numbers of trees stays within the ranges 64 and 128. So that is what will be done in this study.

3.4.4 Evaluation criteria

Since previous research into computer crime victimization has used descriptive modeling and not
predictive modeling, it is not known which evaluation criteria are most important. However, the
research done in similar areas, such as offline crime and offline crime victimization, can provide some
insights. For instance, Riesen and Serpen (2009) used accuracy score to evaluate the Bayesian belief
networks used to predict offline crime victimization. In addition, Neuilly, Zgoba, Tita and Lee (2011)
calculated the error rate, which is equal to (1-accuracy), of the Decision Trees and Random forests
when evaluating the ability of the models to predict relapses of homicide offenders. Finally, Yearwood
and Mammadov (2010) evaluated Support Vector Machine used to profile phishing emails using
accuracy score. Since accuracy score has been used in areas similar to computer crime victimization, it
will also be used to evaluate the prediction task performed in this study.

Accuracy is a common performance measure for machine learning algorithms. Accuracy is calculated
by dividing the correctly predicted instances by the total instances (Daumé, 2012). However, this
metric can be misleading when used to evaluate performance on an unbalanced dataset. An alternative
for accuracy is Cohen’s kappa. Cohen’s Kappa is a metric that compares observed accuracy and
expected accuracy (which is equal to random chance). By taking random chance into account, the
metric becomes less misleading than the simple accuracy score. (Cohen, 1968). For example, an
observed accuracy of 0.85 is much more impressive when the expected accuracy is 0.50 than when the
expected accuracy is 0.80. Since this study deals with an unbalanced dataset, Kappa will be used as an
evaluation metric.

Next to kappa, the Area under the ROC curve (AUC) will be calculated to asses overall classification
performance. The AUC is equal to the probability that a classifier will rank a randomly chosen
positive instance higher than a randomly chosen negative instance (Hanly & McNeil, 1982). The AUC
index for useful classifiers is constrained between 0.5 (representing chance behavior) and 1.0
(representing perfect classification performance) (Mazurowski, Habas, Zurada, Lo, Baker, & Tourassi,
2008). ROC curves are often used when evaluating classification on an unbalanced dataset since they

27
do not place more emphasis on one class over another (Kotsiantis, Kannellopoulos & Pintelas, 2006).
This evaluation metric will be useful since this study deals with an unbalanced test set.

3.5 Implementation

The merging of the different datasets and pre-processing to compute the final dataset was done using
Rstudio (Version 0.99.903). Also, descriptive statistics were retrieved using the package ggplot2 in
Rstudio.

The implementation of the SMOTE technique in Python 3.0 was used to balance the dataset (Lemaitre,
Nogueira & Aridas, 2016). In addition, the analyses were conducted using Numpy and SciPy in
Python 3.0.

28
4. Results

This section provides the results of the classification task performed. The goal was trying to predict
whether or not a person is a victim of computer crimes, and finding which features are important for
prediction. First of all, section 4.1 describes the prediction of victimization in general, of victimization
of person-based crimes and of victimization of computer-based crimes. Secondly, section 4.2 will
discuss which features are most important for the prediction tasks according to the point biserial
correlation. After that, section 4.3 will discuss which features are most important according to the
algorithm which shows the best performance. Finally, section 4.4 will summarize the results found.

4.1 Predicting victimization

The first task to be performed was a binary classification task where 0 meant no victim and 1 meant
victim (either of person-based crimes, computer-based crimes or both). By applying the SMOTE
technique (specifically SMOTE+Tomek), the unbalanced dataset was turned into a balanced dataset.
Some exploratory analyses on the dataset showed SMOTE improves performance on the models and
thus it will be applied to balance the training set. In the balanced training set, there were 3093 non-
victims and 3008 victims. Figure 5 shows the distribution of victims and non-victims before sampling
and after sampling. After dealing with the class imbalance, the classification task was performed to
predict whether a respondent was a victim of computer crimes or not. Three algorithms were used
when predicting the classes, namely Logistic Regression, Naïve Bayes and Random Forest.
Normalizing the features did not influence performance much.

3500

3000

2500

2000
No victim
1500 Victim

1000

500

0
Before Sampling After Sampling
Figure 5. Distribution of victims and non-victims in the training set before and after sampling.

29
4.1.1 Parameter optimization

In order to find the optimal parameters for the algorithms, gridsearch and 10-fold cross-validation
were used. Naïve Bayes was excluded from gridsearch since this algorithm has no parameters to tune.
For the Logistic Regression model it was tested whether L2 regularization would improve accuracy,
and it was tested which solver option was best. Concerning the Random Forest, the number of trees
tested was ranged between 64 and 128. In addition, the number of randomly selected features to
choose from at each split was ranged between 2 and 14 since it was suggested by previous research to
choose values that are evenly spaced between 2 and the number of predictors (Kuhn & Johnson,
2012). Table 7 gives an overview of the parameters tried and the result of gridsearch.

Algorithm Parameters tried Optimal parameter found


Logistic Regression penalty : (l1, l2, ) penalty: l2
solver: (lbfgs, sag, newton-cg, liblinear) solver: newton-cg
Naïve Bayes - -
Random Forest n_estimators: (64, 96, 128) n_estimators: 96
max_features: (2, 8, 14) max_features: 14
Table 7. Optimal parameters found using gridsearch, for each algorithm for the predicting
victimization

4.1.2 Performance metrics

Cohen’s Kappa

After finding the optimal parameters for the algorithms, the kappa scores and the AUC (Area under the
ROC curve) were retrieved. These metrics are used since they do not place more emphasis on one
class over another, making it useful for evaluating performance on an unbalanced dataset. Table 8
shows the scores retrieved scores for each algorithm.

Algorithm Kappa AUC


Logistic Regression 0.496 0.789
Naïve Bayes 0.428 0.734
Random Forest 0.327 0.644
Table 8. Kappa score and AUC of algorithms when predicting victimization in general

As Table 8 shows, Random forest performs worse. This model had a kappa score below 0.40 making it
a fair model according to the guidelines created by Landis and Koch (1977). In addition, it had an
AUC closest to 0.5. Naïve Bayes and Logistic Regression perform better. They both have a kappa
score between 0.40 and 0.60 making it moderate models according to Landis and Koch (1977), and
generate better AUC scores. Logistic regression achieves the highest scores, thus making it the best
working model for predicting victimization in general. Since previous literature did not use machine
learning algorithms in their method and thus did not report any results of the evaluation metrics used
here, the results cannot be compared to any previous results. However, the score retrieved by Logistic
Regression is a fairly good score, so it can be suggested that Logistic Regression is successful at

30
predicting victimization. It was expected that Random Forest would perform better since it is often a
suitable algorithm when predicting situations of which not much is known.

4.2 Predicting victimization of person-based and computer-based crimes

This section will discuss to what extent self-control and the elements of the lifestyles/routine activities
theory are good predictors of victimization of person-based crimes and victimization of computer-
based crimes. In order to do this, the classification tasks need to be performed. Two classification
tasks were performed. The first task had victimization of person-based crimes as target variable, the
second task had victimization of computer-based crimes as target variable. The models used for these
tasks were Logistic Regression, Naïve Bayes and Random Forest.

4.2.1 Predicting victimization of person-based crimes

Before performing the prediction task, the optimal parameters for the models needed to be found.
Table 9 shows the parameters found.

Algorithm Parameters tried Optimal parameter found


Logistic Regression penalty : (l1, l2, ) penalty: l2
solver: (lbfgs, sag, newton-cg, liblinear) solver: lbfgs
Naïve Bayes - -
Random Forest n_estimators: (64, 96, 128) n_estimators: 129
max_features: (2, 8, 14) max_features: 8
Table 9. Optimal parameters found using gridsearch, for each algorithm for prediction victimization
of person-based crimes.

After finding the optimal parameters, the binary task could be performed. This task had victimization
of person-based crimes as a target. Table 10 shows the kappa score and the AUC for each algorithm
used in this task.

Algorithm Kappa AUC


Logistic Regression 0.025 0.693
Naïve Bayes 0.019 0.671
Random Forest -0.006 0.498
Table 10. Kappa scores and AUC for each algorithm used in predicting victimization of person-based
crimes.

As can be seen in Table 10 none of the algorithms perform really well. The Random Forest performs
worst with a negative kappa score and an AUC below 0.5 which indicates chance behavior. Logistic
Regression perform better concerning AUC, but still has a very low kappa score. The low scores
retrieved here, compared to the scores retrieved for predicting victimization in general, are probably
due to the unbalancing dataset, where very little number of respondents were victims of person-based
crimes.

4.2.2 Predicting victimization of computer-based crimes

31
Before performing the prediction task, the optimal parameters for the models needed to be found.
Table 11 shows the parameters found.

Algorithm Parameters tried Optimal parameter found


Logistic Regression penalty : (l1, l2, ) penalty: l2
solver: (lbfgs, sag, newton-cg, liblinear) solver: newton-cg
Naïve Bayes - -
Random Forest n_estimators: (64, 96, 128) n_estimators: 64
max_features: (2, 8, 14) max_features: 14
Table 11. Optimal parameters found using gridsearch, for each algorithm for prediction victimization
of computer-based crimes.

After finding the optimal parameters, the binary task could be performed. This task had victimization
of person-based crimes as a target. Table 12 shows the kappa score and the AUC for each algorithm
used in this task.

Algorithm Kappa AUC


Logistic Regression 0.476 0.784
Naïve Bayes 0.403 0.741
Random Forest 0.275 0.627
Table 12. Kappa scores and AUC for each algorithm used in predicting victimization of computer-
based crimes.

The numbers shown in Table 12 are comparable to the numbers found when predicting victimization
in general. Again, Logistic Regression performs best with a kappa score of 0.476 and an AUC of
0.784. Naïve Bayes has scores not far below Logistic Regression. Random Forest performs worst.
Since previous literature did not use machine learning algorithms in their method and thus did not
report any kappa and AUC scores, the score retrieved here cannot be compared to any previous scores.
However, these scores are not very bad, and thus indicate a well performing Logistic Regression
model.

4.3 Feature importance

Feature importance can be established independently or dependently of a classification model. In this


study, the importance of features independently of a classification model is established by using the
point biserial correlation coefficient (Kornbrot, 2005). The point biserial correlation coefficient can be
used to measure the relationship between different values. Furthermore, feature importance
dependently of a classification model will be established by looking at the coefficients of Logistic
Regression, since this was the best performing model in all three classification tasks. Feature
importance will be analyzed three times. First for the binary task of predicting victimization.
Secondly, for the binary task of predicting victimization of person-based crimes. Finally, for the
binary task of predicting victimization of computer-based crimes.

4.3.1 Victimization

32
First of all, the point biserial correlation coefficients were calculated, to check importance
independently of a classification model. The results, ordered by strength, can be found in Table 13.

Nr. Feature Point biserial correlation coefficient


13 protect_standard_extra 0.4713
9 hours_internet 0.1163
4 age 0.1107
3 surname 0.1058
8 pictures 0.1018
7 email 0.0918
1 educational_level 0.0876
6 telephone_nr 0.0694
11 protect_standard 0.0663
5 address 0.0605
2 self_control 0.0555
0 sex -0.0484
12 protect_extra 0.0463
10 hours_IM 0.0276
Table 13. Point biserial correlation coefficients when predicting victimization.

As Table 13 shows, all coefficients are below 0.5 and thus no strong correlations are found between
the target variable and the features. When predicting victimization, the top 5 features with the highest
absolute coefficients are protect_standard_extra, hours_internet, age (filling out age truthfully online),
surname (filling out surname truthfully online) and pictures (providing an honest picture of yourself).
What is interesting to see is the large difference between the coefficient for protect_standard_extra,
and all other features. This one shows by far the strongest relationship with victimization. However,
the other values are very small, so it is useful to look at importance related to a classification model to
make more reliable conclusions. Since the Linear Regression model performed best when predicting
victimization, the coefficients of this model are good indicators of feature importance. The
coefficients, ordered by strength can be found in Table 14. Before retrieving the coefficients, the
features were standardized.

Nr. Feature Logistic regression coefficient


13 protect_standard_extra 1.4288
11 protect_standard 0.4066
12 protect_extra 0.2423
2 self_control 0.1108
4 age 0.0889
6 telephone_nr 0.0585
3 Surname -0.0548
1 educational_level 0.0449
9 hours_internet 0.0398
7 email 0.0368
0 sex 0.0364
8 pictures -0.0364

33
5 address 0.0115
10 hours_IM 0.0105
Table 14. Logistic Regression coefficients when predicting victimization in general.

For the Logistic Regression model, the five most important features are protect_standard_extra,
protect_standard, protect_extra, self-control and age (filling out your age truthfully online). The
importance of having protection installed was expected because of the high point biserial correlation
coefficient. However, what is surprising is the importance of self-control for Logistic Regression. This
because it did not show equal importance, compared to the other features, for the point biserial
correlation coefficient.

When looking at the point biserial correlation coefficients and the Logistic Regression coefficients it
can be concluded that the most important feature for predicting victimization is having installed
standard and extra protection. This feature scored highest on both metrics and had a large difference in
coefficient with the other features. In addition, some of the factors representing suitable target showed
to be important. Finally, there seems to be no strong relationship between victimization and self-
control, but it does seem to be a good predictor for victimization.

4.3.2 Victimization of person-based crimes

First of all, the point biserial correlation coefficients were calculated, to check importance
independently of a classification model. The ordered results can be found in Table 15.

Nr. Feature Point biserial correlation coefficient


13 protect_standard_extra 0.1048
7 email 0.0413
3 surname 0.0397
11 protect_standard 0.0373
4 age 0.0371
9 hours_internet 0.0335
8 pictures 0.0265
1 educational_level 0.0230
6 telephone_nr 0.0216
10 hours_IM 0.0122
2 self_control -0.0095
12 protect_extra 0.0044
0 sex -0.0014
5 address 0.0002
Table 15. Point biserial correlation coefficients when predicting victimization of person-based crimes.

All coefficients in Table 15 are below 0.5 and thus no strong correlations are found between the target
variable and the features. When predicting victimization of person-based crimes, the top 5 features
with the highest absolute coefficients are protect_standard_extra, email (filling out email address
online truthfully), surname (filling out surname truthfully online), age (filling out age online

34
truthfully) and protect_standard. Again, having standard and extra protection has the strongest
relationship with victimization. Also, there seems to be a relationship with the features representing
suitable target. However, the values are still very small, so additional tools are needed to check
importance. Although the models predicting victimization of person-based crimes where not very
successful, it could be useful to have a look at the coefficients of the Logistic Regression model, since
these feature importances can be used as a starting point for future research. Table 16 shows the
coefficients.

Nr. Feature Logistic Regression coefficient


13 protect_standard_extra 1.0982
8 pictures -0.5446
11 protect_standard 0.4905
6 telephone_nr 0.3878
5 address -0.3241
7 email 0.2944
2 self_control -0.2680
0 sex 0.2064
12 protect_extra 0.1782
4 age 0.1537
10 hours_IM -0.0818
9 hours_internet 0.0636
3 surname 0.0626
1 educational_level -0.0600
Table 16. Logistic Regression coefficients when predicting victimization of person-based crimes.

The five most important features for the Logistic Regression model when predicting victimization of
person-based crimes are protect_standard_extra, pictures (providing pictures online truthfully),
protect_standard, telephone_nr (providing telephone number online truthfully) and address (providing
address online truthfully). The importance of these features was expected when looking at the point
biserial correlation coefficients.

In conclusion, it seems the capable guardianship and suitable target are the most important factors
when predicting victimization of person-based crimes. It is surprising that self-control did not show
more importance since many researchers found evidence for the relationship between self-control and
victimization of person-based crimes (Bossler & Holt, 2010; Ngo & Paternoster, 2011; Van Wilsem,
2013). However, since the results found were not very strong, additional research is needed to confirm
or contradict the results found here.

4.3.3 Victimization of computer-based crimes

To study feature importance independently of a classification model for victimization of computer-


based crimes, the point biserial correlation coefficients were calculated. The results, ordered by
strength, can be found in Table 17.

35
Nr. Feature Point biserial correlation coefficient
13 protect_standard_extra 0.4534
9 hours_internet 0.1095
4 age 0.1026
3 surname 0.0969
8 pictures 0.0966
1 educational_level 0.0831
7 email 0.0820
6 telephone_nr 0.0648
5 address 0.0620
2 self_control 0.0596
11 protect_standard 0.0570
0 sex -0.0492
12 protect_extra 0.0462
10 hours_IM 0.0247
Table 17. Point biserial correlation coefficients when predicting victimization of computer-based
crimes.

As shown in Table 17 the top 5 features with the highest absolute coefficients are
protect_standard_extra, hours_internet, age (filling out age online truthfully), surname (filling out
surname online truthfully), and pictures (providing pictures of self). This was expected when looking
at the correlation coefficients with victimization in general. The resemblance between the two is
probably due to the resemblance in datasets. However, the coefficients are again relatively small, so
feature importance dependently of a classification model needs to be generated to better establish
feature importance.

Since Logistic Regression was the best performing model when predicting victimization of computer-
based crimes, feature importance for this classification model can be generated. Table 19 shows the
coefficients of the predictive model.

Nr. Feature Logistic regression coefficient


13 protect_standard_extra 1.3624
11 protect_standard 0.3128
12 protect_extra 0.2201
2 self_control 0.1810
6 telephone_nr 0.0717
7 email -0.0667
4 age 0.0443
1 educational_level 0.0356
8 pictures 0.0299
3 surname -0.0251
5 address 0.0215
0 sex 0.0178
9 hours_internet 0.0039
10 hours_IM 0.0005
Table 19. Logistic Regression coefficients for predicting victimization of computer-based crimes.

36
According to Table 19 the most important features for predicting victimization of computer-based
crimes are protect_standard_extra, protect_standard, protect_extra, self-control and telephone_nr
(providing telephone number online truthfully). The importance of having protection was expected
when looking at the point biserial correlation coefficients. However, the importance of self-control is
surprising, considering the low ranking within the correlation coefficients.

In conclusion, capable guardianship is the most important factor when predicting victimization of
computer-based crimes. Specifically, having installed standard and extra protection seems to be a good
predictor of victimization of computer-based crimes. Self-control did not have a strong relationship
with victimization, but turned out to be one of the most important feature for the Logistic Regression
model. This result supports research by Holtfreter et al. (2008), who showed that self-control is related
to victimization of computer-based crimes.

4.4 Summary of the results

Based on the previous sections it can be concluded that Logistic Regression performs best for
predicting computer crime victims. Logistic Regression achieves the highest kappa score and
generates the highest Area under the ROC curve (AUC). The features that were most important for
making the distinction between victims and non-victims were having standard and extra measures
installed (capable guardianship), and filling out some information online truthfully (suitable target).
Furthermore, no important relationship was found with self-control, but it did turn out to one of the
best predictors.

The analyses showed that the three models predicting victimization of person-based crimes where less
successful. This meant that feature importance dependently on a classification model could be
generated, but not much weight should be put on the results. The lack of success for the models is
probably due to the low representation of victims of person-based crimes in the dataset. The analyses
showed that capable guardianship and suitable target were most important for the prediction task.
These factors thus could be used as a starting point for further research.

The feature that is most important for predicting victims of computer-based crimes is having installed
standard and extra protection (capable guardianship). The importance of this feature was supported by
as well the point biserial correlation coefficient as the Logistic Regression model. Furthermore, no
correlation was found with self-control, but it turned out to be an important feature for Logistic
Regression.

37
5. Discussion and conclusion

This section provides a discussion of the research conducted and the results found. First of all, the
research questions formulated in section 1.4 will be answered. Also, the contribution of this study
within the existing framework will be illustrated. Finally, limitations of the study and
recommendations for future research will be discussed.

5.1 Answers to research questions

This study deals with the following problem statement:

Problem statement: Previous literature showed mixed results on which individual level factors and
situational level factors are related to computer crime victimization. In addition, previous literature
only describes the possible relationships between the variables, but does not study whether the factors
can predict computer crime victimization..

In order to find an answer to the problem statement, three research questions were formulated.

Research question 1: To what extent can the general theory of crime and the lifestyles/routine
activities theory be used for studying computer crime victimization?

Research question 2: To what extent can individual level factors predict computer crime
victimization?

Research question 3: To what extent can situational level factors predict computer crime
victimization?

The remainder of this section will discuss the answers to the research questions, which answer the
problem statement.

Research question 1: To what extent can the general theory of crime and the lifestyles/routine
activities theory be used for studying computer crime victimization?

Two traditional victimization theories often applied are the general theory of crime and the
lifestyles/routine activities theory. These theories discuss which factors increase the risk of committing
crimes and of falling victim to crimes, traditionally in an offline setting. According to studies
performed by Buzzell, Foss and Middleton (2006), Higgins (2005) and Marcum (2008), the general
theory of crime can also be applied to computer crime victimization. Furthermore, amongst others,
Ngo and Paternoster (2011) and Choi (2008) show that the lifestyles/routine activities theory is also
applicable to computer crime victimization. The experiments performed in this study confirm that the
two theories indeed can be applied to an online setting, next to an offline setting. The results showed

38
that the Logistic Regression algorithm achieves a reasonable kappa score and AUC using the factors
of the theories as input features. However, the scores are not really high. This could be due to the
algorithms chosen or the features created. Trying different features to represent the factors could be a
good start. Also, applying different algorithms could give additional insights into the applicability of
the two traditional theories to prediction of victimization. Finally, it is important to keep in mind that
an algorithm depends heavily on the underlying dataset. It could be the case that the algorithms used
here perform better on a more balanced dataset where more equal training and test sets can be created.

Next to Logistic Regression, Random Forest and Naïve Bayes were used to predict computer crime
victimization using the factors of the two traditional victimization theories. A small difference was
found between the scores of Naïve Bayes and Logistic Regression, but larger differences were found
with the scores of Random Forest, which performed worse compared to Logistic Regression. This
could be due to the nature of the data, since Random Forest can only deal with non-linearly separable
data.

In conclusion, the answer to the first research question is: the two traditional victimization theories can
successfully be applied to computer crime victimization. However, additional research is needed to
support the results found here, or put them in context.

Research question 2: To what extent can individual level factors predict computer crime
victimization?

The individual level factor used in this study to predict computer crime victimization was self-control,
which was extracted from the general theory of crime (Gottfredson & Hirschi, 1990). Only persons
who showed consistency in their level of self-control were chosen in this study. This because
Gottfredson and Hirschi (1990) stated that although the level of self-control can change a bit over
time, the relative level of self-control, or ‘ranking’, does not change. The results showed that there was
no strong relationship between self-control and victimization. However, since the point biserial
correlation coefficients where all very small, not much weight should be placed on these. Additional
analyses showed that self-control is one of the most important predictors for victimization, since it had
one of the strongest coefficients for the Logistic Regression model. However, what was surprising was
that both self-control coefficients were positive, meaning that there is a positive relationship between
self-control and victimization. This would mean that higher levels of self-control lead to a prediction
of 1, which was equal to victim in this study. This is surprising, since many studies showed the
opposite relationship between the two, where low levels of self-control lead to victimization. The
reason behind this outcome needs to be studies further.

In this study, the distinction between person-based crimes and computer-based crimes is made. Here,
person-based crimes target specific individuals (e.g. obtaining password, online harassment) and

39
computer-based crimes target computers in general (e.g. malware infection). Regarding person-based
crimes, an averaged model was found. This could be due to the underlying dataset, which was heavily
unbalanced, and had large differences in training and test set as a result. By looking at the point
biserial correlation coefficients and the Logistic Regression coefficients, self-control had a weak
relationship with victimization of person-based crimes. This results contradicts results found by
Bossler and Holt (2010), who showed that low-levels of self-control are strongly related to
victimization of person-based crimes. In addition, it contradicts findings by Ngo and Paternoster
(2011) and van Wilsem (2013), who found that low levels of self-control often lead to victimization of
person-based crimes. However, because of the lack of a successful model, not much weight should be
put on the results found here and additional research is needed to confirm or to invalidate the results
found here.

Regarding computer-based crimes, self-control had one of the lowest point biserial correlation
coefficients. Furthermore, this coefficient was positive, meaning higher levels of self-control, instead
of lower levels, would lead to victimization of computer-based crimes. However, since the point
biserial correlation coefficients where all really small, not much weight should be placed on these.
Self-control did seem to be of importance for the Logistic Regression model. This result is in
agreement with research by Holtfreter et al. (2008), who showed level of self-control influences the
risk of becoming a victim of computer-based crimes.

The absence of importance of self-control for the correlation coefficients, but the presence of
importance according to the algorithms, could be due to correlation between features. In this study, the
point biserial correlation coefficients were only calculated between the input features and the target
variable and not among the input features. Additional research needs to be done to see whether the
importance of self-control for prediction is caused by correlation with another feature.

In conclusion, the answer to research question two is: the individual level factor self-control is
important for predicting victimization in general and for predicting victimization of computer-based
crimes. However, no importance for predicting victimization of person-based crimes was found.

Research question 3: To what extent can situational level factors predict computer crime
victimization?

The situational level factors used in this study were suitable target, exposure to motivated offender and
capable guardianship, which were all extracted from the lifestyles/routine activities theory (Cohen &
Felson, 1979; Hindelang et al., 1978). The results showed that capable guardianship, and specifically
having installed standard and extra protecting, was by far the most important factor for predicting
victimization. In addition, some of the features representing suitable target showed to be important

40
when predicting computer crime victimization. However, exposure to motivated offender turned out to
have very little importance, which was not expected.

In this study, the distinction between person-based crimes and computer-based crimes is made. Here,
person-based crimes target specific individuals (e.g. obtaining password, online harassment) and
computer-based crimes target computers in general (e.g. malware infection). Regarding person-based
crimes, the models did not perform very well. Furthermore, the relationships found using the point
biserial correlation coefficient were not very strong, so no confident conclusions can be made about
the predicting factors of the three situational level factors. However, when testing different algorithms
in the future, it could be useful to use the two situational level factors capable guardianship and
suitable target as input. This could be useful for three reasons. First of all, the point biserial correlation
coefficients and the Logistic Regression coefficients were highest for these factors. Secondly, the
analyses have shown that capable guardianship and suitable target are important for predicting
victimization in general. The fact that the results show a stronger relationship between capable
guardianship and victimization of person-based crimes than between motivated offender and
victimization of person-based crimes is not in line with previous research. Marcum (2008) namely
reported that no relationship could be found between capable guardianship and computer crime
victimization, and Bossler & Holt (2010) and Ngo and Paternoster (2011) showed that exposure to
motivated offender increases the risk of victimization of person-based crimes.

When predicting victimization of computer-based crimes, capable guardianship seems to be the most
important feature. The importance of this features is supported by the point biserial correlation
coefficient and the Logistic Regression model. This finding is in agreement with research by Choi
(2008), who showed that having antivirus or firewall software is related to victimization of computer
crimes. However, it contradict research by Marcum (2008) and Bossler & Holt (2009), who reported
that no relationship could be found between capable guardianship and victimization of computer-based
crimes. It is surprising that suitable target is less important for predicting victimization of computer-
based crimes. This is the case for two reasons. First of all, suitable target did seem to be important
when predicting victimization in general. Secondly, many researchers found evidence for the
relationship between suitable target and victimization of computer-based crimes (Bossler & Holt,
2009; Choi, 2008; Hinduja & Patchin, 2008). However, these studies measured suitable target by
looking at time spent on a computer and computer proficiency, which was not done in this study. So
the lack of importance of suitable target for this classification task could be due to the features chosen
to represent the factor.

Exposure to motivated offender seemed to be of little importance for all three prediction tasks. This
was surprising since Bossler and Holt (2009) and Ngo and Paternoster (2011) reported that hours spent

41
on the Internet and on IM increases the amount of time exposed to motivated offenders, and ultimately
increases the risk of computer-crime victimization.

In conclusion, the answer to research question three is as follows: the situational level factor capable
guardianship is by far the most important feature in all three prediction tasks. Furthermore, suitable
target is of importance when predicting victimization in general and victimization of person-based
crimes. Finally, exposure to motivated offender turned out to be of little importance for all three
prediction tasks.

5.2 Contribution

This study provides some first insights into the predicting factors of computer crime victimization.
This is also the most important contribution of this study. Existing literature uses descriptive modeling
to describe the data, but does not explore the use of predictive modeling. Thus by introducing a new
technique to study computer crime victimization, a contribution is made to the existing research area.

In addition, most of the research done in this area had students from the U.S. as respondents. By using
respondents from a different country and with a more wide age range, the generalizability of the
results increases.

5.3 Limitations and further research

Although efforts have been made to conduct correct experiments and produce reliable results, the
study still has some limitations. First of all, the classification was conducted on an unbalanced dataset.
The dataset contained a large number of non-victims but a small number of victims. This could be due
to the fact that crimes are rarely detected by the victims and consequently not reported to authorities
(Choi, 2008), as was stated in section 2.1. So it could very well be the case that respondents with the
class label “no victim” in this study, were actually victims without knowing it. This can cause noise in
the analyses performed and results generated, and thus future research needs to take the unawareness
of possible victims into account when collecting data. To deal with the unequal division, a
combination between under-sampling and over-sampling was used on the training set. However, to
ensure realistic data, the sampling was not applied to the test set, creating large difference in training
and test set. Future research needs to conduct similar analyses on a naturally balanced dataset, where
an equal distribution between training and test set can be created, to check the results found here.

Another limitation could be the features that were chosen to represent the individual level factor self-
control, and the situational level factors suitable target, motivated offender and capable guardianship.
The features created were based on previous literature and on the questions provided in the
questionnaire. However, the question is whether these features correctly represent the factors of the
traditional crime theories. As was already discussed when answering the research questions, the

42
models perform moderately, and the coefficients and feature importances are not really high. So there
is a possibility that choosing somewhat different features to represent the factors can increase model
performance. Future research needs to look into this, and elaborate on which features can be used best
to represent the individual and situational level factors.

Finally, three algorithms were tried to correctly classify the computer crime victims. All three
performed moderately or bad. It is possible that other models, that were not taken into consideration,
perform even better on the different classification tasks. If this is the case, it could also lead to
different feature importances, since feature importance depends heavily on the algorithm used. Thus,
future research needs to conduct the classification tasks with different models to see if these perform
better, and consequently, have different feature importance as a result.

43
References

Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003, December). Balancing Training Data for
Automated Annotation of Keywords: a Case Study. In WOB (pp. 10-18).
Bossler, A. M., & Burruss, G. W. (2011). The general theory of crime and computer hacking: Low
self-control hackers. Corporate hacking and technology-driven crime: Social dynamics and
implications, 38-67.
Bossler, A. M., & Holt, T. J. (2009). On-line activities, guardianship, and malware infection: An
examination of routine activities theory. International Journal of Cyber Criminology, 3(1),
400.
Bossler, A. M., & Holt, T. J. (2010). The effect of self-control on victimization in the
cyberworld. Journal of Criminal Justice, 38(3), 227-236.
Bossler, A. M., Holt, T. J., & May, D. C. (2012). Predicting online harassment victimization among a
juvenile population. Youth & Society, 44(4), 500-523.
Buzzell, T., Foss, D., & Middleton, Z. (2006). Explaining use of online pornography: A test of self-
control theory and opportunities for deviance. Journal of Criminal Justice and Popular Culture,
13(2), 96-116.
Choi, K. S. (2008). Computer crime victimization and integrated theory: An empirical
assessment. International Journal of Cyber Criminology, 2(1), 308.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority
over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Clarke, R. V., & Newman, G. R. (2003). Superhighway robbery: preventing e-commerce crime (pp. 1-
240). Willan Publishing.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or
partial credit. Psychological bulletin, 70(4), 213.
Cohen, L. E., & Felson, M. (1979). Social change and crime rate trends: A routine activity
approach. American sociological review, 588-608.
Daumé III, H. (2012). A course in Machine Learning. chapter, 5, 69.
Dickman, S. J. (1990). Functional and dysfunctional impulsivity: personality and cognitive correlates.
Journal of personality and social psychology, 58(1), 95.
Franklin, C. A., Franklin, T. W., Nobles, M. R., & Kercher, G. A. (2012). Assessing the effect of
routine activity theory and self-control on property, personal, and sexual assault victimization.
Criminal Justice and Behavior, 39(10), 1296-1315.

44
Gibson, C. L. (2011). An investigation of neighborhood disadvantage, low self-control, and violent
victimization among youth. Youth violence and juvenile justice, 1541204011423767.
Gottfredson, M. R., & Hirschi, T. (1990). A general theory of crime. Stanford University
Press.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating
characteristic (ROC) curve. Radiology, 143(1), 29-36.
Higgins, G. E. (2004). Can low self-control help with the understanding of the software piracy
problem?. Deviant Behavior, 26(1), 1-24.
Higgins, G. E., Fell, B. D., & Wilson, A. L. (2006). Digital piracy: Assessing the contributions of an
integrated self‐control theory and social learning theory using structural equation modeling.
Criminal Justice Studies, 19(1), 3-22.
Higgins, G. E., Wolfe, S. E., & Marcum, C. D. (2008). Digital piracy: An examination of three
measurements of self-control. Deviant Behavior, 29(5), 440-460.
Hindelang, M. J., Gottfredson, M. R., & Garofalo, J. (1978). Victims of personal crime: An
empirical foundation for a theory of personal victimization. Cambridge, MA:
Ballinger.
Hinduja, S., & Patchin, J. W. (2008). Cyberbullying: An exploratory analysis of factors related to
offending and victimization. Deviant behavior, 29(2), 129-156.
Holtfreter, K., Reisig, M. D., & Pratt, T. C. (2008). Low self‐control, routine activities, and fraud
victimization. Criminology, 46(1), 189-220.
Holtfreter, K., Reisig, M. D., Piquero, N. L., & Piquero, A. R. (2010). Low self-control and fraud
offending, victimization, and their overlap. Criminal Justice and Behavior, 37(2), 188-203.
Hua, J., & Bapna, S. (2013). The economic impact of cyber terrorism. The Journal of Strategic
Information Systems, 22(2), 175-186. Hua, J., & Bapna, S. (2013). The economic impact of
cyber terrorism. The Journal of Strategic Information Systems, 22(2), 175-186.
Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets: A review.
GESTS International Transactions on Computer Science and Engineering, 30(1), 25-36.
Kotsiantis, S. B., & Pintelas, P. E. (2003). Mixture of expert agents for handling imbalanced data sets.
Annals of Mathematics, Computing & Teleinformatics, 1(1), 46-55.
Kornbrot, D. (2005). Point biserial correlation. Wiley StatsRef: Statistics Reference Online.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26). New York: Springer.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data.
biometrics, 159-174.
Lemaitre, G., Nogueira, F., & Aridas, C. K. (2016). Imbalanced-learn: A python toolbox to tackle the
curse of imbalanced datasets in machine learning. arXiv preprint arXiv:1609.06570.
Marcum, C. D. (2008). Identifying potential factors of adolescent online victimization for high school
seniors. International Journal of Cyber Criminology, 2(2), 346.

45
Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., & Tourassi, G. D. (2008).
Training neural network classifiers for medical decision making: The effects of imbalanced
datasets on classification performance. Neural networks, 21(2), 427-436.
Ngo, F. T., & Paternoster, R. (2011). Cybercrime victimization: An examination of individual
and situational level factors. International Journal of Cyber Criminology, 5(1), 773.
Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012, July). How many trees in a random forest?. In
International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp.
154-168). Springer Berlin Heidelberg.
Piquero, A. R., MacDonald, J., Dobrin, A., Daigle, L. E., & Cullen, F. T. (2005). Self-control, violent
offending, and homicide victimization: Assessing the general theory of crime. Journal of
Quantitative Criminology, 21(1), 55-71.
Pratt, T. C., & Cullen, F. T. (2000). The empirical status of Gottfredson and Hirschi's general theory of
crime: A meta‐analysis. Criminology, 38(3), 931-964.
Pratt, T. C., Turanovic, J. J., Fox, K. A., & Wright, K. A. (2014). Self‐control and victimization: A
meta‐analysis. Criminology, 52(1), 87-116.
Riesen, M., & Serpen, G. (2009). A Bayesian Belief Network Classifier for Predicting Victimization
in National Crime Victimization Survey. In IC-AI (pp. 648-652).
Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on
empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46). IBM New York.
Saini, H., Rao, Y. S., & Panda, T. C. (2012). Cyber-crimes and their impacts: A review. International
Journal of Engineering Research and Applications, 2(2), 202-209.
Schreck, C. J. (1999). Criminal victimization and low self-control: An extension and test of a general
theory of crime. Justice Quarterly, 16(3), 633-654.
Schreck, C. J., Wright, R. A., & Miller, J. M. (2002). A study of individual and situational antecedents
of violent victimization. Justice Quarterly, 19(1), 159-180.
Stewart, E. A., Elifson, K. W., & Sterk, C. E. (2004). Integrating the general theory of crime into an
explanation of violent victimization among female offenders. Justice Quarterly, 21(1), 159-
181.
Van Wilsem, J. (2011). ‘Bought it, but never got it’ assessing risk factors for online consumer
fraud victimization. European Sociological Review, jcr053.
van Wilsem, J. (2013). Hacking and harassment—Do they have something in common? Comparing
risk factors for online victimization. Journal of Contemporary Criminal Justice, 29(4), 437-
453.
Yar, M. (2005). The novelty of ‘cybercrime’ an assessment in light of routine activity theory.
European Journal of Criminology, 2(4), 407-427.

46

Anda mungkin juga menyukai