| |
STATISTICS
The importance of distributions
a. Allow for social scientists to organize data that possibly could
not have been interpretable without the organization.
b. Distribution means the ordering of data in order of magnitude
i. Frequency distribution allows the researcher to see more
general trends than he or she could with an unordered set of data
ii. Frequency distributions are also significantly more compact
and therefore take up less space when we write them up
iii. Frequency distribution looks like the following:
x |
f
|
10 |
1 |
8 |
1 |
7 |
3 |
6 |
3 |
5 |
2 |
4 |
1 |
3 |
1 |
c. The easiest way to examine frequencies is through the use of
graphing. There are two types of graphs that I would recommend:
i. The histogram – here a bar is drawn for each raw score, with
the height of the bar representing the frequency the score occurs
ii. The asterisk graph – here the raw score is presented on the
left of the graph and one asterisk is placed in the second column
for each occurrence of the value (these asterisks go across
horizontally). These graphs allow us to visually see how our data is
falling in relation to all of the values.
Measures of central tendency
d. These are measures designed to give information concerning the
average or typical score within a larger set of scores. For example
suppose we want to test the average IQ for prisoners housed in the
disciplinary unit at Parchman. We would use one of the soon to be
discussed methods of examining the average or typical inmate’s IQ.
e. There are 3 measures of central tendency that we will talk about
(Mean, Median, and Mode).
i. Mean – The mean is the most commonly used measure of central
tendency because it is the best mathematical calculation of the
average. It is denoted as either capital "M" or as " ",
also known as X-bar.
1. The mean is calculated using the following equation:
. This equation is read as the mean is
equal to the sum of all X values (raw scores) divided by the
number of cases in the study. The use of capital letters versus
lower-case is important and should be considered because a
lowercase "n" would stand for only the subjects in one group,
where capital "N" stands for all subjects in a study. Here there
is no difference but in later sections there will be a
difference between the two.
2. The mean is the most often used but can be susceptible to
extreme scores termed "outliers", or scores that are extremely
unusual and outside the normal range of scores in the study.
3. Interpretation of the mean must be conducted carefully as
the difference between groups could impact the difference
between mean scores.
ii. The Median – this is the exact midpoint of any distribution
and is where 50% of the cases will fall above and 50% will fall
below. The median is used less than the mean because it is not as
mathematically computed. The median is denoted as "Mdn".
1. To calculate the median, one has to arrange the data in
distribution form in the order of magnitude. Then count up to
the middle point in the distribution.
2. If there are an odd number of cases in the study, for
example 9, then the median would be the 5th value.
3. If there is an even number of cases in the study, for
example 8, then the median could be computed by taking the 4th
and 5th scores and dividing by 2.
4. This measure is less susceptible to the presence of
outliers and extreme scores.
iii. The Mode – this is the most commonly occurring score in a
distribution and is symbolized as "Mo".
1. If we have a frequency distribution developed, then we
merely have to look for the score with the highest "f" value. If
we have a histogram or asterisks graph then we merely have to
look for the score under the tallest bar or the longest
asterisk.
2. If we do not have a graph or a frequency distribution,
then we have to order the data and then attempt to see which
occurs most frequently.
3. Occasionally, you may find a set of scores that has more
than one mode. In this case you have what is known is a bimodal
distribution if there is 2, or a multimodal distribution if
there is more than 2. Here, you many find that two separate
groups have been classified together. For example, in a study
concerning the number of hours inmates spend with family members
during visitation, you may have accidentally grouped minimum
security (who receive less limited access) and maximum security
(who are limited) together in the same study.
Why do we need to understand all 3 measures? Because occasionally we
may encounter a distribution that is skewed.
f. The normal distribution – this is the perfect distribution of
scores. There are an equal number of cases falling both above and below
the middle point of the curve. In this situation, the mean, median, and
mode all fall at the same point.
g. Skewed distributions – this occurs when extreme scores pull the
mean away from the middle point of the distribution.
i. Negative skew – this is where the tail of the curve slopes to
the left of the distribution. It is characterized by the mean of the
data set’s scores being less than the median and the mode. In fact,
the mean will be the smallest value, then will come the median
(which is the center point) and then will come the mode (which is
characterized by the large hump on the curve).
ii. Positive skew – this is where the tail of the curve slopes to
the right of the distribution. It is characterized by the mean of
the data set’s scores being greater than the median and the mode.
The mean is in fact the largest value, followed by the median (which
is still the center of the distribution), and then the smallest
number will be the mode (characterized by the large hump on the
curve).
Measures of Variability
h. Just as the measures of central tendency allow us to examine the
average of the scores, the measures of variability allow us to examine
how much our scores vary. This variance is to be expected since we will
rarely (hopefully never) encounter a study where everyone scores the
same on a test or a subject of study. These measures then allow us to
gain a better understanding of the true nature of the data sets.
Consider the weather example given in class.
i. There are 3 measures of variability to discuss.
i. Range – this is the quickest and easiest method of calculating
variability. It is also the least recommended because it is
extremely influenced by extreme scores.
1. The range is calculated by subtracting the smallest score
from the largest score in the distribution.
2. For example, if we have are examining the number of
arrests and we have someone who has been previously arrested 30
times and we have someone who has been arrested 2 times, then
our range is 28.
3. The range can also drastically change. It can never go
down but it can continue to go up. If a new subject is added
that had 40 arrests, then our range is now 38.
ii. Standard Deviation – this is considered the greatest measure
of variability. Unlike the value for the range, the standard
deviation takes into consideration all of the scores in a dataset.
The standard deviation allows us to examine how much our scores vary
around the mean. Because the mean is being considered in the
standard deviation, the data used must be interval or ratio in
nature. The standard deviation is denoted using one of two symbols
"SD" or "s". There are two methods of computing the standard
deviation.
1. The deviation method. Not considered the easiest but can
be covered if there is an interest in this method.
2. The computational method. Here the standard deviation is
calculated using the following formula:
3. The use of the standard deviation allows us to determine
exactly how much the scores vary around the mean. Suppose we
develop two treatment programs that reduce the number of arrests
a juvenile will suffer in their life. If we test program one and
we get a SD of 2, and we test program 2 and get a SD of 8, then
which program should we push to the administration?
iii. The Variance – this measure of variability is closely
related to the standard deviation. In fact, it is merely the
standard deviation squared, or it is the above formula without
taking the standard root. We use two measures for the same concept
because later on with advanced statistics we will rely on the
measure of variance.
Z scores and the Normal Curve
j. The Normal Curve – this is the perfect bell-shaped curve where 50%
of our cases fall below the mean and 50% of our cases fall above the
mean. There are some common characteristics of the normal curve: the two
halves are identical to each other and the measures of central tendency
will all occur at the same point (in the very center).
i. We can standardize the normal curve and we will find that the
mean will be 0 and the standard deviation units will be marked off
in increments of 1. This is necessary because occasionally we may
want to compare two separate groups. Without standardization we
cannot because we would be comparing "apples and oranges".
ii. Still, 50% of the cases fall above and 50% fall below. So we
can state the following: 34.13% of all cases fall between 0 and +1
and 34.13% of all cases fall between 0 and -1. Further 13.59% of all
cases will fall between +1 and +2 and 13.59% of cases will fall
between -1 and -2. Finally, 2.15% of cases will fall between +2 and
+3 and 2.15% of cases will fall between -2 and -3.
iii. Now we can combine this information and state that 68% (a
little more) of all cases will fall somewhere between + or – 1
standard deviation unit from the mean of the distribution.
iv. Therefore, if we have a sample of test scores with a mean of
75 and a standard deviation of 5, then we can state that 68% of our
scores will fall between the score of 70 and 80. 95% of our scores
will fall between 65 and 85. And 99% of our scores will fall between
60 and 90. The remaining 1% will fall either above or beyond 60
and/or 90.
k. Z scores – this value assists us in interpreting raw score
performance since it takes into account the mean and the measures of
variability (the standard deviation). The z score allows us to examine
an individual’s score versus that of the rest of the distribution. We
can also compare the individual’s performance on two separate normally
distributed sets of scores.
i. Easy to calculate and then we merely refer to the table in the
back of the book (Table III – Normal Distribution – p. 564).
Positive z scores tell us the number of scores that fall above the
calculated z score and negative z scores tell us the number of
scores that fall below the calculated z score.
ii. Calculated by using the following formula:

|