ENGR 7 W03 HW1 solutions

(total: 100 pts)

1. (L & L) Problem 3.10, page 65 (10 pts total)

(a) (4 pts)

Frequency table: (the number of intervals should be between 5 and 20)

Interval

Frequency

Relative Frequency

0-300

3

0.125

301-600

11

0.4583

601-900

4

0.1667

901-1200

2

0.0833

1201-1500

1

0.0417

1501-1800

1

0.0417

1801-2100

0

0

2101-2400

0

0

2401-2700

1

0.0417

2701-3000

0

0

3001-3300

0

0

3301-3600

0

0

3601-3900

0

0

3901-4200

0

0

4201-4500

0

0

4501-4800

0

0

4801-5100

1

0.0417

Total

24

1


(b) (2 pts)

Relative frequency histogram:



(c) (2 pts)

The histogram has a long right tail (is skewed to the right). (2 pts)

There is an outlier in the interval 2401-2700, and an extreme outlier in the interval 4801-5100 [not required].

(d) (2 pts)

Since 6 out of the 24 cities have city taxes more than $900, so the chance that your city taxes would be more than $900 is 6/24 = 25%.

2. (P & G) Problem 14, page 33, chapter 2 (12 pts total)

(a) (2 pts)

The 1987 group tend to have lower blood lead levels, because the 1987 distribution has more data at the lower end of blood lead level.

You can see this more clearly from their histograms (density scale): [not required]




(b) (8 pts)

Table of cumulative frequencies: (4 pts)


Blood Lead Level (ug/dl)

1979 (%)

1987 (%)

< 20

11.5

37.8

< 29

23.6

52.5

< 39

37.5

65.6

< 49

52.9

80.9

< 59

69.4

91.4

< 69

82.2

98.2

< 79

90.6

99.6

>= 80

100

100

Cumulative frequency polygons: (4 pts)


(c) (2 pts)

The cumulative frequency polygon for the 1979 group (in red) lies to the right of that for the 1987 group (in blue), so the distribution of blood lead levels for the 1979 group is stochastically larger.

3. (L & L) Problem 3.34, page 80 (6 pts total)

(a) (2 pts)

Median = 154, Mean = 173.7,.

Mode = 300 [not required]

(b) (2 pts)

Since we do not know the exact failure times for some of the engines and we simply truncated those times to 300, so our mean is actually less than the true mean. Our median, however, is the same as the true median (think about why?). There is no definite relationship between our mode and the true mode, though [not required].

(c) (2 pts)

Any reasonable example given by the student will do.

4. (P & G) Problem 7, pages 60, chapter 3 (10 pts total)

(a) (4 pts)

Calcium levels (in mmol/l):

mean = 3.14, median = 3.08, SD = 0.51, range = 1.47

(b) (4 pts)

Albumin levels (in g/l):

mean = 40, median = 42, SD = 3, range = 9

(c) (2 pts)

(For healthy individuals, the normal range of calcium levels is 2.12 to 2.74 mmol/l, while the normal range of albumin levles is 32 to 55 g/l)

For patients with vitamin D intoxication:

The majority of their calcium levels are higher than the normal range, but all their albumin levels are well within the normal range. So they have abnormal (higher than normal) calcium levels but normal albumin levels.

5. (P & G) Problem 8, pages 60-61, chapter 3 (6 pts total)

(a) (2 pts)

(Please refer to page 41 of P & G for calculation of median)

Median for bulimic: 21.6

Median for healthy: 30.6

(b) (2 pts)

(Please refer to page 44 of P & G for calculation of percentiles)

For bulimic:

25th percentile = 18.1, 75th percentile = 25.2, IQR (interquartile range) = 25.2 - 18.1 = 7.1

For healthy:

25th percentile = 23.8, 75th percentile = 36.6, IQR (interquartile range) = 36.6 - 23.8 = 12.8

(c) (2 pts)

A healthy individual typically eats more than a bulimic individual (by looking at their medians) (students can also compare their means if they want)

The healthy group has a greater amount of variability than the bulimic group (by looking at their IQRs) (students can also compare their SDs if they want)

6. (P & G) Problem 9, page 61, chapter 3 (7 pts total)

(a) (3 pts)

Europe has the smallest mean, because all data points lie between 0 and 50 and the mean is less than 25;

Africa has the largest median, because the data spread out from 0 to 200 and the median is somewhere close to 100;

Europe has the smallest standard deviation, because the spread (variability) of data is smallest.

(b) (4 pts)

For Africa, we would expect the mean and the median to be approximately equal, because the distribution is roughly symetric and there are no extreme values.

But for Asia, the answer is no, its mean and median are quite different, because the distribution is skewed to the right and there are extreme observations to the high end which pulls the mean up.

7. (L & L) Problem 3.50, page 100 (6 pts total)

Boxplot: (4 pts)


Describe the shape of the distribution: roughly symmetric, no outliers. (2 pts)

8. (P & G) Problem 10, page 156, chapter 6 (10 pts total)

(a) (2 pts)

P(private insurance) = 0.387

(b) (4 pts)

P(govt. program) = P(Medicare OR Medicaid OR Other govt. program) = P(Medicare) + P(Medicaid) + P(Other govt. rogram) = 0.345 + 0.116 + 0.033 = 0.494

(c) (4 pts)

P(Medicare | govt. program) = P(Medicare AND govt. program) / P(govt. program) = P(Medicare) / P(govt. Program) = 0.345/ 0.494 = 0.698

9. (P & G) Problem 11, page 157, chapter 6 (6 pts total)

For any randomly selected adult between the ages of 45 and 64:

P(uninsured) = 0.123, P(insured) = 1- 0.123 = 0.877

(a) (2 pts)

P(a woman uninsured AND a man uninsured) = P(a woman uninsured) * P(a man uninsured) = 0.123 * 0.123 = 0.015

Note: the first equal sign results from the independence of individuals. Same reasoning for (b) and (c).

(b) (2 pts)

P(a woman insured AND a man insured) = P(a woman insured) * P(a man insured) = 0.877 * 0.877 = 0.769

(c) (2 pts)

P(all five adults uninsured) = P(1st adult uninsured) * P(2nd adult uninsured) * ... * P(5th adult uninsured) = 0.123 ^ 5 = 2.8 * 10 ^ (-5) = 0.000028

10. (Taken from Elementary Statistics by Larson and Farber, 2000) (6 pts total)

Chebychev's inequality: For any number k that is greater than or equal to 1, at least [ 1 - (1 / k) ^ 2 ] of the measurements in the set of data lie within k standard deviations of their mean.

For k=3, 1 - ( 1 / 3 ) ^ 2 = 1- 1/9 = 8/9 = 88.9% (2 pts)

In this case, mean = 3.32 minutes, SD = 1.09 minutes. The interval that is within 3 SD from mean is:

(3.32 - 3 * 1.09, 3.32 + 3 * 1.09) = (0.05, 6.59) (2 pts)

Real world interpretation: About 8 times out of 9 (or 89 times out of 100, or 89% of the time), the duration of Old Faithful's eruption will be between 0.05 minutes and 6.59 minutes. (2 pts)

11. (L & L) Problem 4.18, page 135 (11 pts total)

A = the event that the response comes from site 1;

B = the event that the response is poor;

(a) (3 pts)

P(A) = number of responses from site 1 / total number of responses = 192 / (192 + 248) = 0.436

P(B) = number of poor responses / total number of responses = (48 + 80) / (192 + 248) = 0.291

P(A AND B) = number of poor responses from site 1 / total number of responses = 48 / (192 + 248) = 0.109

(b) (2 pts)

P(A) * P(B) = 0.436 * 0.291 = 0.127

P(A AND B) = 0.109

Since P(A AND B) is not equal to P(A) * P(B), so events A and B are not independent.

(c) (6 pts)

P(B | A): (2 pts)

One way to do it: P(B | A) = P(A AND B) / P(A) = number of poor responses from site 1 / total number of responses from site 1 = 48 / 192 = 0.25

Another way to do it: P(A AND B) / P(A) = 0.109 / 0.436 = 0.25

P(B | AC): (2 pts)

AC = the event that the response comes from site 2.

One way to do it: P(B | AC) = P(AC AND B) / P(AC) = number of poor responses from site 2 / total number of responses from site 2 = 80 / 248 = 0.323

Another way to do it:

P(AC) = 1 - P(A) = 1 - 0.436 = 0.564,

P(AC AND B) = number of poor responses from site 2 / = total number of responses = 80 / (192 + 248) = 0.182,

P(B | AC) = P(AC AND B) / P(AC) = 0.182 / 0.564 = 0.323

Clearly P(B | A) and P(B | AC) are not equal. (2 pts)

12. (Taken from Rosner, page 135) (10 pts total)

If events E1, E2, ...., En are mutually exclusive and exhausive, i.e. P(Ei and Ej) = 0 (i, j different) and P(E1 or E2 or .... or En) = 1, then for any event H:

H = (H and E1) or (H and E2) or ... or ( H and En)

P(H) = P(H and E1) + P(H and E2) + .... + P(H and En) = P(E1) * P(H | E1) + P(E2) * P(H | E2) + .... + P(En) * P(H | En)

For this problem:

H = have a low birthweight infant

E1 = length of gestation < 20 weeks

E2 = length of gestation is 20-27 weeks

E3 = length of gestation is 28-36 weeks

E4 = length of gestation > 36 weeks

P(H) = P(E1) * P(H | E1) + P(E2) * P(H | E2) + P(E3) * P(H | E3) + P(E4) * P(H | E4)

= (0.0004 * 0.540) + (0.0059 * 0.813) + (0.0855 * 0.379) + (0.9082 * 0.035)

= 0.0002 + 0.0048 + 0.0324 + 0.0318

= 0.0692 = 6.92%

So the probability of having a low birthweight infant is 6.92%.



**********************************************************************************************

**********************************************************************************************



Due to typos on hw1 problem list, some students did different problems for 8 and 9. They will not be penalized.

Here are the solutions to these problems:

8. (P & G) Problem 10, page 62, chapter 3 (6 pts total)

(a) (4 pts)

Use midpoints of those intervals for calculations of grouped mean and grouped SD.


Grouped mean

Grouped SD

Smoker

199

111

Nonsmoker

29

34

(b) (1 pts)

Median for smokers falls in the interval 200-249 ng/ml;

Median for nonsmokers falls in the interval 0-13 ng/ml;

(c) (1 pts)

Compared to nonsmokers, smokers have much higher cotinine levels and larger SDs in the distribution.

9. (P & G) Problem 11, page 62, chapter 3 (10 pts total)

(a) (3pts)

mean = 87.94, median = 86.00, (1 pts)

SD = 16.00, range = 103, (1 pts)

25th percentile: nk/100 = 462*25/100 = 115.5 (round up) = 116, x[116] = 76

75th percentile: nk/100 = 462*75/100 = 346.5 (round up) = 347, x[347] = 98

so interquartile range = 98 - 76 = 22 (1 pts)

(b) (2 pts)

Chebychev's inequality says: at least (1 - (1/k)^2 ) of the measurements lie within k SDs of the mean.

With k =2, 2 SDs either way from the mean is the interval 87.94 +/- 2*16 = [55.94, 119.94], while (1 - (1/2)^2) = .75, so at least 75% of the measurments will be within the interval [55.94, 119.94]. (1 pts)

With k =3, 3 SDs either way from the mean is the interval is 87.94 +/- 3*16 = [39.94, 135.94], while (1 - (1/2)^3) = .89, so at least 89% of the measurments will be within the interval [39.94, 135.94]. (1 pts)

(c) (3 pts)

According to the empirical rule, you would expect:

Approximately 95% of the measurements lie within 2 SDs of the mean; almost all of the measurements lie within 3 SDs of the mean. (1 pts)

The facts about this data set:

98% of the 462 measurements lie within 2 SDs of the mean; 99% of the 462 measurements lie within 3 SDs of the mean. (2 pts)

(d) (2 pts)

From (b) and © we see that the empirical rule does a better job of summarizing these serum zinc levels than Chebychev's inequality. (1 pts)

Why? Because: (1 pts)

Chebychev's inequality is a more conservative statement and is true for any data set regardless of the shape of the distribution, thus is less specific;

The empirical rule is more specific and is a good approximation when the distribution is nimodal and roughly symmetric.

In this case the distribution of the data is not terribly far from being symmetric, so the empirical can be applied here and it gives us a more precise and better summary of the data.