Anda di halaman 1dari 9

Business Statistics

Assignment I
Date: 20/09/2019

Group’s members Student Number


Michiel Merchiers 1901213932
Diogo Gouveia 1901213939
Samuel Yazo 1901213940
Philipp Heckel 1901213931

Question 1 (70 marks) An activity which involves analyzing the first and last names of
groups of students.
PART A: Collecting Data
First, ask each group member the following 2 questions.

1. How many letters do you have in your first name?

Name Number of letters


Michiel 7
Diogo 5
Samuel 6
Philipp 7
Total 25

2. Do you have the letter E in your last name? (Yes/ No)


Then, compile the letters appear in the first names of your group.

Name Yes/No
Merchiers Yes
Gouveia Yes
Yazo No
Heckel Yes

3. What are these letters?


The letters are nominal data as it does not provide any quantitative value. The
following letters appear in our names:

Compilation of letters appearing in the first name


a c d e g h i l m o p s u
4. How many times does each letter appear in the first names?

Letter Absolute Frequency


a 1
c 1
d 1
e 2
g 1
h 2
i 5
l 3
m 2
o 2
p 3
s 1
u 1
Total 25

PART B
a) (15 marks) Compile your group's data in excel tables. Summarize statistics for the
information about the letters in the first names and the presence of letter E in the last name
of your group.

First name Last name Number of letters in first name E in last name
Michiel Merchiers 7 Yes
Diogo Gouveia 5 Yes
Samuel Yazo 6 No
Philipp Hackel 7 Yes

Letter Absolute Relative Cumulative


Frequency Frequency1 Absolute Frequency
a 1 0.04 1
c 1 0.04 2
d 1 0.04 3
e 2 0.08 5
g 1 0.04 6
h 2 0.08 8
i 5 0.2 13
l 3 0.12 16
m 2 0.08 18
o 2 0.08 20
p 3 0.12 23
s 1 0.04 24
u 1 0.04 25

1
Please see Appendix for all the used formulae.
% of Names with letter E 75%

% of Names without letter E 25%

Average letters per Mode Standard deviation


name
6 Letter i 0,95 ≈ 1

b) (10 marks) Which are categorical data? Which are numerical data? Describe the data level
of each of the data in 1-4.

# exercise Data Level Type of Data


1 Ordinal data Numerical data
2 Nominal data Categorical data
3 Nominal data Categorical data
4 Ordinal data Numerical data

c) (15 marks) Present the information with appropriate graphs.

Letter absolute frequency in first Presence of letter 'e' in last


names name (%)
6

5
5 25%
4

3
3 3 75%
2
2 2 2 2
1
1 1 1 1 1 1 Last names containing letter 'e'
0
a c d e g h i l m o p s u Last names without letter 'e'

Type of data from exercises 1-4

Categorical data
50% 50%
Numerical data
(d) (15 marks) Now combine with another group, which group do you choose? Collect the
statistics for both groups and present the information with graphs as in (a) and (c).

Our additional data was collected from the group composed by Yunus Sen, Griff
Bryant and Remi Hingst.

Name Number of letters in first name


Michiel 7
Diogo 5
Samuel 6
Philipp 7
Yunus 5
Griff 5
Remi 4
Total 39

Letter
(xi) Absolute Frequency Relative Frequency Cumulative frequency
a 1 0.025641026 1
c 1 0.025641026 2
d 1 0.025641026 3
e 3 0.076923077 6
f 2 0.051282051 8
g 2 0.051282051 10
h 2 0.051282051 12
i 7 0.179487179 19
l 3 0.076923077 22
m 3 0.076923077 25
n 1 0.025641026 26
o 2 0.051282051 28
p 3 0.076923077 31
r 2 0.051282051 33
s 2 0.051282051 35
u 3 0.076923077 38
y 1 0.025641026 39
Average letters per name Mode Standard deviation
5.57 Letter 1. 13
i

Presence of Letter 'e' in last name Type of data from exercises


(%) 1-4

43%
57% 50% 50%

Last names containing letter 'e'


Last names without letter 'e' Categorical data Numerical data

Letter absolute frequency in first names


8
7
7
6
5
4
3
3 3 3 3 3
2
2 2 2 2 2 2
1
1 1 1 1 1
0
a c d e f g h i l m n o p r s u y

Now, think of the combined group (yours and the other group of your choice) of 8-10
students as a sample of students from the population of all 120 Business Statistics students,
THINK CREATIVELY and answer the following questions.
(e) (10 marks) Write a statement that summarizes the information about the number of
letters, and the frequencies of the letters in the first names of Business Statistics students.
According to our sample data (n=7), we can infer that on the business statistics class
(which is composed by 154 students), the most commonly occurring letter in the first name
is ‘i’. Based on this, we expect the same for the population of first names of statistics
students. Moreover, our sample mean (𝑥̅ ) of the number letters in first names equals 5.6
letters. Based on this, we infer that the population mean (µ) should be close to 5.6.
Moreover, given that the sample standard deviation (s) is roughly equal to one, we expect
the majority of people having a name length between roughly 4 letters and roughly 7 letters.
Finally, it would be appropriate to test a null hypothesis where the population mean would
be different from 5 and an alternative hypothesis where the population mean would equal 5.
However, in order to test that hypothesis, the sample should be larger than thirty (for an
approximate normal distribution) or should be assumed to be normally distributed.
Please see exercise f for a final note regarding the biases present in our non-randomly
selected sample.
f) (10 marks) Write a statement that summarizes the information about the presence of the
letter E in the last names of business statistics students.
When examining the frequency of the letter E in the last names of our group
members (with n=4) the letter E occurred in 75% of the sample data which suggests that the
letter E is a very commonly occurring letter in the population of 154 Business Statistics
students. However, when further including the data of a second group (now n=7), we verify
a decrease in the frequency of E in last names to 57%. Given that we have more data to rely
on, this latter result is likely to be more accurate. As such, we infer that at least half of the
population has the letter E in the last name.
Nevertheless, the above inferences (in exercises e and f) are subject to several biases.
First, the sample was not randomly selected. Instead, we used a convenience sample,
without consideration of the underlying sample characteristics. Therefore, our sample
consisted only of international students (which make up a significantly lower proportion of
the entire population), and all subjects were male. While international students are likely to
have very different names than Chinese students, men typically have different names from
women. This could result in one letter (be it I or E) occurring more often for either of the
sub-groups, resulting in a biased inference.
Moreover, our sample only consists of 7 subjects, which is a very small sub-group of
the entire population. As we have seen above, an addition of more datapoints can have a
strong impact on frequencies, meaning that we likely have too little data to make a proper
inference. The suggested sample size should be:
Given the fact that the population standard deviation (σ) is unknown, we can
calculate the appropriate sample size using the Yamane formula:

𝑁
𝑛𝑌 =
(1 + (𝑁 ∗ 𝑒 2 ))
Where 𝑁 is the population and 𝑒 is the alpha level. If the alpha level is 0.05 (confidence
interval of 95%), we get the following:
154
𝑛𝑌 =
(1 + (154 ∗ 0.052 ))
𝑛𝑌 = 111
Even though the suggested sample size according to the formula applied should be
around 111 names of students. However, as a rule of thumb, the minimum sample size is 30
for an approximate normal distribution.

Question 2
(15 marks) Discuss the different types of data – survey, time-series and cross-sectional/panel
data. Write down examples of each of these types. What types of institutions provide this
data?
Survey: refers to data collected through the application of different questions to a certain
number of people from a sample, which should represent the total population.
- Types of institutions that provide this data: private companies, psychology research
centers, national statistics bureau.

- Example: the government is thinking about whether or not they should upgrade the
electricity supply system; upgrading the system would cost the taxpayers a lot of
money. To decide whether or no to implement this upgrade, they survey the
population to ask them about their opinion on and need for a new power supply
system, using seven point Likert scale items. The questionnaires were distributed
through the postal service.

Time series data: refers to data collected on the same subject over a certain period of time,
usually more than 5 years.
- Types of institutions that provide this data: universities, research centers, hospital,
Central Banks, databases.

- Examples:
 Results of the implementation of a new drug to one sick patient over a period
of 20 years.
 Stock prices.
 Total export volume of one country.
 Public revenue from tax collection.

Cross sectional data: refers to data collected from different subjects at one specific point in
time.
- Types of institutions that provide this data: banks, private companies, Governments.

- Examples:
 Profit of various companies for the month of July 2018.
 Net promoter score of all fortune 500 companies as of December 31, 2018.
 GDP of all European countries for the year 2016.

Panel data: refers to the data collected from different subjects throughout a certain period
of time. There are two dimensions, the time and the cross sectional dimension.

- Types of institutions that provide this data: Central Banks, UN, Governments, think
tanks, IMF.

- Examples:
 Collecting the annual value of exports in chemicals of the 28 EU members to
other European member states between 2004 and 2017.
 Foreign debt of South American countries for the last ten years.
 Total daily revenue of four different car companies for the year 2013.

Question 3
(10 marks) In 1974, the Franklin National Bank in the USA failed. It was one of the 20 largest
banks and the largest ever to fail. Could Franklin’s weakened condition have been detected in
advance by simple data analysis? The scatterplot gives the total assets (in billions of dollars)
and net income for the 20 largest banks in 1973. The assets for Franklin were 3.8 and the net
income was 13.8

Describe the overall pattern of the plot. Are there any banks with unusually high or low
income relative to their assets? Does Franklin stand out from other banks?

Franklin Bank: By looking at the graph, the last three banks with the highest net income
stand far from the fourth highest and are located by themselves by the upper right corner.
Furthermore, when looking at the scattered plot, the Franklin Bank performance was in
accordance with the relation net income/net assets. The data can be well approximated by a
linear regression line that shows a positive relation between the net income and assets. The
Franklin bank does seem to deviate significantly from the linear regression line.
Overall pattern: From the information presented in the scatterplot, we can conclude that
there is a positive relation between net income and total assets. The four largest banks
(according to their net income) are also well-approximated by the linear regression line. The
level of net income of the last two banks is unusually high, relative to the amount of assets
they possess, setting them apart from the rest. To conclude, the bankruptcy could not have
been detected as it is well-approximated by the linear regression line.

Appendix:
𝛴𝑥𝑖
Sample Mean= 𝑛

̅̅̅̅̅
∑(𝑥𝑖 −𝑥)²
Sample Standard deviation= √ 𝑛−1

Anda mungkin juga menyukai