class: title-slide, center, middle background-image: url(images/theridges.jpg_large) background-size: cover # .large[.fancy[Statistical Tests and Basic Modeling]] ## .fancy[Professor Ruhil] --- class: inverse, middle # .salt[.fancy[Agenda]] - Review of `hypothesis` testing - Overview of some key statistical tests - `t-tests` - one-sample - two-sample (aka independent samples) - paired - `\(\chi^2\)` (chi-square) - Overview of some basic `regression models` - linear regression `\((y \text{ is continuous})\)` - logistic regression `\((y \text{ is binary})\)` --- class: inverse, middle # .salt[.fancy[Hypothesis Testing]] --- **`Hypothesis testing`** is an inferential procedure that uses sample data to evaluate the credibility of a hypothesis about a population parameter. The process involves ... 1. Stating a **`hypothesis:`** an assumption that can neither be fully proven nor fully disproven. For example, * Not more than 5% of GM trucks breakdown in under 10,000 miles * Heights of North American adult males is distributed with `\(\mu = 72\)` inches * Mean year-round temperature in Athens (OH) is `\(> 62\)` * 10% of Ohio teachers are Accomplished * Mean county unemployment rate is 12.1% 2. Drawing a sample to test the hypothesis 3. Conducting the test itself to see if the hypothesis should be rejected --- #### The **`Null`** and the **`Alternative`** Hypotheses - Null Hypothesis: `\((H_{0})\)` is the assumption believed to be true - Alternative Hypothesis: `\((H_{a})\)` is the statement believed to be true if `\((H_{0})\)` is rejected `\(H_{0}\)`: `\(\mu > 72\)` inches; `\(H_{1}\)`: `\(\mu \leq 72\)` `\(H_{0}\)`: `\(\mu < 72\)` inches; `\(H_{1}\)`: `\(\mu \geq 72\)` `\(H_{0}\)`: `\(\mu \leq 72\)` inches; `\(H_{1}\)`: `\(\mu > 72\)` `\(H_{0}\)`: `\(\mu \geq 72\)` inches; `\(H_{1}\)`: `\(\mu < 72\)` `\(H_{0}\)`: `\(\mu = 72\)` inches; `\(H_{1}\)`: `\(\mu \neq 72\)` `\(H_{0}\)` and `\(H_{1}\)` are `mutually exclusive` and `mutually exhaustive` * Mutually Exclusive: Either `\(H_{0}\)` or `\(H_{1}\)` is True `\(\cdots\)` both cannot be true at the same time * Mutually Exhaustive: `\(H_{0}\)` and `\(H_{1}\)` exhaust the Sample Space `\(\cdots\)`; there are no other possibilities unknown to us --- ### `Type I and Type II Errors` <table class="table table-hover" style="font-size: 14px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; font-weight: bold; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Decision based on Sample</div></th> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; font-weight: bold; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Null is true</div></th> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; font-weight: bold; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Null is false</div></th> </tr> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> <th style="text-align:left;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Reject the Null </td> <td style="text-align:left;"> Type I error </td> <td style="text-align:left;"> No error </td> </tr> <tr> <td style="text-align:left;"> Do not reject the Null </td> <td style="text-align:left;"> No error </td> <td style="text-align:left;"> Type II error </td> </tr> </tbody> </table> .pull-left[ `Type I Error:` Rejecting the Null hypothesis `\(H_0\)` when `\(H_0\)` is true i.e., we should not have rejected the Null ] .pull-right[ `Type II Error:` Failing to reject the Null hypothesis `\(H_0\)` when `\(H_0\)` is false i.e., we should have rejected the Null ] The probability of committing a Type I error `\(= \text{ Level of Significance }= \alpha\)` The probability of committing a Type II error `\(= \text{ Level of Significance }= \beta\)` The power of the test is `\((1 - \beta)\)` We have to decide how often we want to make a Type I error (i.e., falsely Reject `\(H_0\)`). Conventionally we set this rate to one of the following `\(\alpha\)` values: `$$\alpha=0.05 \text{ or } \alpha=0.01$$` Note the very cautious language ... `Reject` `\(H_{0}\)` versus `Do Not Reject` `\(H_{0}\)` --- ### The Process of Hypothesis Testing: One Example Assume we want to know whether the roundabout on SR682 has had an impact on traffic accidents in Athens. We have historical data on the number of accidents in years past. Say the average per day used to be 6 (i.e., `\(\mu_0 = 6\)`). To see if the roundabout has had an impact we could gather accident data for a random sample of 100 days `\((n=100)\)` from the period after the roundabout was built. Before we do that though, we will need to specify our hypotheses. What do we think might be the impact? Let us say the City Engineer argues that the roundabout should have decreased accidents. * If he is correct then the sample mean `\(\bar{x}\)` should be less than the population mean `\(\mu_0\)` i.e., `\(\bar{x} < \mu_0\)` * If he is wrong then the sample mean `\(\bar{x}\)` should be at least as much as the population mean `\(\mu_0\)` i.e., `\(\bar{x} \geq \mu_0\)` --- ### The Sampling Distribution of `\(\bar{x}\)` We know from the theory of sampling distributions that the distribution of sample means, for all samples of size `\(n\)`, will be normally distributed (as shown below) Most samples would be in the middle of the distribution but `by sheer chance` we could end up with a sample mean in the tails. This will happen with a very small probability but `it could happen!!` <img src="images/hyp.png" width="100%" style="display: block; margin: auto;" /> --- ### Scenario 1: Engineer says `\(\bar{x} < \mu\)` If we believe the City Engineer, we would setup the hypotheses as follows: - `\(H_0\)`: The roundabout does not reduce accidents, i.e., `\(\mu \geq \mu_0\)` - `\(H_1\)`: The roundabout does reduce accidents, i.e., `\(\mu < \mu_0\)` - Next set `\(\alpha\)` (the probability of making a Type I error) - We then calculate the sample mean `\((\bar{x})\)` and the sample standard deviation `\((s)\)` - Next we calculate the standard error of the sample mean: `\(s_{\bar{x}} = \dfrac{s}{\sqrt{n}}\)` - Now we calculate `\(t = \dfrac{\bar{x} - \mu_0}{s_{\bar{x}}}\)` and this is what we call `\(t_{calculated}\)` - Using `\(df=n-1\)`, find the area to the left of `\(t_{calculated}\)` - If this area is very small we can conclude that the roundabout must have worked to reduce accidents - How should we define `very small`? By setting `\(\alpha\)` either to 0.05 or to 0.01 We Reject `\(H_0\)` if `\(P(t_{calculated}) \leq \alpha\)`; the data provide sufficient evidence to conclude that the roundabout has reduced accidents If `\(P(t_{calculated}) > \alpha\)` then we Fail to reject `\(H_0\)`; the data provide insufficient evidence to conclude that the roundabout has reduced accidents --- ### Rejection rule Reject `\(H_0\)` if calculated `\(t\)` falls in the green region (i.e., calculated `\(t \leq -1.6603\)`) <img src="images/hyp3.png" width="100%" style="display: block; margin: auto;" /> --- ### Failure to reject rule Do Not Reject `\(H_0\)` if calculated `\(t\)` falls in the grey region (i.e., `\(t_{calculated} > -1.6603\)`) <img src="images/hyp4.png" width="100%" style="display: block; margin: auto;" /> --- ### Scenario 2: Engineer says `\(\bar{x} \neq \mu\)` If we believe the City Engineer, we would setup the hypotheses as follows: - `\(H_0\)`: The roundabout has no impact on accidents, i.e., `\(\mu = \mu_0\)` - `\(H_1\)`: The roundabout has an impact on accidents, i.e., `\(\mu \neq \mu_0\)` - We then calculate the sample mean `\((\bar{x})\)` and the sample standard deviation `\((s)\)` - Next we calculate the standard error of the sample mean: `\(s_{\bar{x}} = \dfrac{s}{\sqrt{n}}\)` - Now calculate `\(t_{calculated} = \dfrac{\bar{x} - \mu_0}{s_{\bar{x}}}\)` - Using `\(df=n-1\)`, find the area to the left/right of `\(\pm t_{calculated}\)` If this area is very small then we can conclude that the roundabout must have worked to reduce accidents How should we define `very small`? By setting `\(\alpha\)` either to 0.05 or to 0.01 We can then Reject `\(H_0\)` if `\(P(\pm t_{calculated}) \leq \alpha\)` ; the data provide sufficient evidence to conclude that the roundabout has reduced accidents If `\(P(\pm t_{calculated}) > \alpha\)` then we will Fail to reject `\(H_0\)`; the data provide insufficient evidence to conclude that the roundabout has reduced accidents --- ### Rejection rule Reject `\(H_0\)` if calculated `\(\lvert{t}\rvert\)` falls in the green region (i.e., calculated `\(t \leq -1.98\)` or calculated `\(t \geq 1.98\)`) <img src="images/hyp5.png" width="100%" style="display: block; margin: auto;" /> --- ### Failure to reject rule Do Not Reject `\(H_0\)` if calculated `\(\lvert{t}\rvert\)` falls in the grey region (i.e., `\(-1.98 < \text{calculated } t < 1.98\)`) <img src="images/hyp6.png" width="100%" style="display: block; margin: auto;" /> --- ### The process revisited ... 1. State the hypotheses - If we want to test whether something `has changed`, or `is different`, or `had an impact`, etc. then `\(H_0\)` must specify that nothing has changed `\(\ldots H_0:\mu = \mu_{0}; H_1: \mu \neq \mu_{0} \ldots\)` **two-tailed** - If we want to test whether something `has increased`, or `has risen`, or `is more` then `\(H_0\)` must specify that it has not increased `\(\ldots H_0:\mu \leq \mu_{0}; H_1: \mu > \mu_{0}\ldots\)` **one-tailed** - If we want to test whether something `has decreased`, or `has reduced`, or `is less` then `\(H_0\)` must specify that it has not decreased `\(\ldots H_0:\mu \geq \mu_{0}; H_1: \mu < \mu_{0}\ldots\)` **one-tailed** 2. Collect the sample and set `\(\alpha=0.05\)` or `\(\alpha=0.01\)` 3. Calculate `\(s_{\bar{x}} = \dfrac{s}{\sqrt{n}}\)`, `\(\bar{x}\)`, `\(df=n-1\)` 4. Calculate the `\(t\)` 5. Reject `\(H_0\)` if calculated `\(t\)` falls in the **critical region**; do not reject `\(H_0\)` otherwise --- ### But of course ... beware Type I and Type II errors <img src="images/type12.png" width="50%" style="display: block; margin: auto;" /> * `Type I Error:` You rejected `\(H_0\)` but it should not have been rejected `\((\text{level of significance} = \alpha)\)` * `Type II Error:` You failed to reject `\(H_0\)` but it should have been rejected --- class: inverse, middle # .salt[.fancy[Continuous y variable]] --- ### Example 1 (One-sample t-test) [From Harrell & Slaughter] We want to test if the mean tumor volume is 190 `\(mm^{3}\)` in a population with melanoma `\(H_0: \mu_0 = 190\)` versus `\(H_1: \mu_0 \neq 190\)` `\(\bar{x} = 181.52, s = 40, n = 100, \mu_0 = 190\)` `\(s_{\bar{x}} = \dfrac{s}{\sqrt{n}} = \dfrac{40}{\sqrt{100}} = 4\)` `\(t = \dfrac{\bar{x} - \mu_{0}}{s_{\bar{x}}} = \dfrac{181.52 - 190}{4} = -2.12\)` `\(p-value = 0.037\)`, leading us to reject `\(H_0\)` if `\(\alpha = 0.05\)` The data do not conform to the pattern predicted by the null hypothesis. --- ### Example 2 (Paired t-test) [From Harrell & Slaughter] To investigate the relationship between smoking and bone mineral density, Rosner presented a paired analysis in which each person had a nearly perfect control which was his or her twin. Data were normalized by dividing differences by the mean density in the twin pair. Computed density in heavier smoking twin minus density in lighter smoking one. Mean difference was -5% and standard error was 2%, with `\(n = 41\)` `\(H_0: \text{mean difference is } \mu_0 = 0\)` versus `\(H_1: \text{mean difference is } \mu_0 \neq 0\)` `\(t = \dfrac{-5 - 0}{2} = -2.5\)` `\(p-value = 0.0166\)` The data do not conform to the pattern predicted by the null hypothesis. --- ### Comparisons of Means from Common Parent Populations <img src="stats101_2019_files/figure-html/unnamed-chunk-8-1.png" width="55%" style="display: block; margin: auto;" /> --- ### Comparisons of Means from Different Parent Populations <div class="figure" style="text-align: center"> <img src="stats101_2019_files/figure-html/unnamed-chunk-9-1.png" alt="Separate Parent Populations" width="45%" /> <p class="caption">Separate Parent Populations</p> </div> --- We often need to compare sample means across two groups. For example, are average earnings the same for men and women in a specific occupation? Perhaps we suspect (a) women are underpaid or (generally) that (b) their salaries differ from those of men. Let the population and sample means be `\(\mu_m, \mu_w\)` and `\(\bar{x}_m, \bar{x}_w\)`, respectively (a) `\(H_0: \mu_m \leq \mu_w\)` and `\(H_1: \mu_m > \mu_w\)`, `\(\therefore H_0: \mu_m - \mu_w \leq 0\)` and `\(H_1: \mu_m - \mu_w > 0\)` (b) `\(H_0: \mu_m = \mu_w\)` and `\(H_1: \mu_m \neq \mu_w\)`, `\(\therefore H_0: \mu_m - \mu_w = 0\)` and `\(H_1: \mu_m - \mu_w \neq 0\)` Standard Error of the difference in means: `\(s_{\bar{x}_m - \bar{x}_w} = \sqrt{\dfrac{s^{2}_{m}}{n_m} + \dfrac{s^{2}_w}{n_w}}\)` Confidence Interval estimate: `\(\left(\bar{x}_m - \bar{x}_w\right) \pm t_{\alpha/2}\left(s_{\bar{x}_m - \bar{x}_w}\right)\)` The Test Statistic: `\(t = \dfrac{\left(\bar{x}_m - \bar{x}_w \right) - \left(\mu_m - \mu_w \right)}{\sqrt{\dfrac{s^{2}_{m}}{n_m} + \dfrac{s^{2}_w}{n_w}}} = \dfrac{\left(\bar{x}_m - \bar{x}_w \right) - D_0}{\sqrt{\dfrac{s^{2}_{m}}{n_m} + \dfrac{s^{2}_w}{n_w}}}\)` --- The degrees of freedom for this test: `\(df = \dfrac{\left(\dfrac{s^{2}_m}{n_m} + \dfrac{s^{2}_w}{n_w} \right)^2}{\dfrac{1}{(n_m -1)}\left(\dfrac{s^{2}_m}{n_m}\right)^2 + \dfrac{1}{(n_w -1)}\left(\dfrac{s^{2}_w}{n_w}\right)^2}\)` Note: We usually `round down` the `\(df\)` to the nearest integer We have two ways of calculating the estimated standard error `\((s_{\bar{x}_1 - \bar{x}_2})\)` and the degrees of freedom `\(df\)` (1) When the population variances are `assumed unequal` (2) When the population variances are `assumed equal` --- ### (1) Unequal Population Variances `Standard Error will be:` `\(\left(s_{\bar{x}_1 - \bar{x}_2}\right) = \sqrt{\dfrac{\sigma^{2}_{1}}{n_1} + \dfrac{\sigma^{2}_2}{n_2}}\)` `Degrees of Freedom will be:` `\(df = \dfrac{\left(\dfrac{s^{2}_m}{n_m} + \dfrac{s^{2}_w}{n_w} \right)^2}{\dfrac{1}{(n_m -1)}\left(\dfrac{s^{2}_m}{n_m}\right)^2 + \dfrac{1}{(n_w -1)}\left(\dfrac{s^{2}_w}{n_w}\right)^2}\)` - Use this when `\(n_1\)` or `\(n_2\)` are `\(< 30\)` and - Either sample has a standard deviation at least twice that of the other sample --- ### (2) Equal Population Variances `Standard Error will be:` `\(\left(s_{\bar{x}_1 - \bar{x}_2}\right) = \sqrt{\dfrac{n_1 + n_2}{n1 \times n_2}}\sqrt{\dfrac{(n_1 -1)s^{2}_{x_{1}} + (n_2 -1)s^{2}_{x_2}}{\left(n_1 + n_2\right) -2}}\)` `Degrees of Freedom will be:` `\(df = \left( n_1 + n_2\right) -2\)` - Use when the standard deviations are roughly equal, and - `\(n_1\)` and `\(n_2\)` `\(\geq 30\)` --- ### Example 3 (Two-Sample t-test) [From Harrell & Slaughter] Two soporific drugs to be tested, Durg 1 versus Drug 2. Which of these is more effective? `\(H_0: \mu_1 = \mu_2\)` versus `\(H_1: \mu_1 \neq \mu_2\)` <img src="stats101_2019_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" /> Assuming unequal variances: `\(t = -1.8608, df = 17.776, p-value = 0.07939\)` so fail to reject `\(H_0\)` Assuming equal variances: `\(t = -1.8608, df = 18, p-value = 0.07919\)` so fail to reject `\(H_0\)` --- class: inverse, middle ### .fat[.fancy[When do these tests break down?]] ### (1) Biased samples ### (2) Non-normally distributed data (can test) ### (3) Insufficient power (calculate a priori) ### (4) Unequal variances (can test) --- # .salt[.fancy[Binomial and Chi-Square Distributions]] --- ### The Binomial Given a certain number of independent trials `\((n)\)` with an identical probability `\((p)\)` of success `\((X)\)` in each trial we can easily calculate the probability of seeing a specific number of successes. For example, if I flip a coin 10 times, where `\(X=Head\)` with `\(p=0.5\)` then `what is probability of seeing exactly 2 heads, exactly 4 heads, 7 heads, etc.?` The answer is easily calculated as: .pull-left[ `$$P\left[X \text{ successes}\right] = \binom{n}{X}p^{X}\left(1 - p\right)^{n-X} \\ \text{where } \binom{n}{x} = \dfrac{n!}{X!(n-X)!} \text{ and } \\ n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1$$` ] .pull-right[ <img src="stats101_2019_files/figure-html/2-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Radiologists and Missing Sons Assume that they are just as likely to have boys as girls. This then generates the following hypotheses: `\(H_0: \text{ Radiologsts are just as likely to have sons as daughters } (p=0.5)\)` `\(H_1: \text{ Radiologsts are not as likely to have sons as daughters } (p \neq 0.5)\)` Let `\(\alpha=0.05\)` The `\(p-value = 0.005014\)` so we can reject `\(H_0\)`; the data provide sufficient evidence to conclude that radiologists are not as likely to have sons as daughters. What if we suspected, _a priori_ that radiologists are less likely to have sons? In that case we would have done the following: `\(H_0: \text{ Radiologsts are at least as likely to have sons as daughters } (p \geq 0.5)\)` `\(H_A: \text{ Radiologsts are less likely to have sons than daughters } (p < 0.5)\)` Again, the `\(p-value=0.002507\)` and we can easily reject `\(H_0\)`; the data provide sufficient evidence to conclude that radiologists are not at least as likely to have sons as daughters. --- ## The `\(\chi^2\)` The `\(\chi^2\)` distribution is used with multinomial data (i.e., when the categorical variable has more than two categories) to test whether observed frequency counts differ from expected frequency counts. ### The Goodness-of-Fit Test for a single variable `\(H_0\)`: Proportions are all the same `\(H_A\)`: Proportions are \textit{not} all the same `$$\chi^{2} = \displaystyle\sum\limits_{i}\dfrac{\left(Observed_i - Expected_i \right)^2}{Expected_i}$$` `\(\chi^2\)` distributed with `\(\left( \text{no. of categories}-1 \right)\)` degrees of freedom `\((\textit{df})\)` Reject `\(H_0\)` if `\(p-value \leq \alpha\)`; Do not reject `\(H_0\)` otherwise As `\(df \to \infty\)` you need a larger `\(\chi^2\)` to Reject `\(H_0\)` at the same `\(\alpha\)` --- The plot below shows how the theoretical `\(\chi^2\)` distribution varies with the degrees of freedom. As the distribution shifts right the degrees of freedom are getting smaller. The first is for 3 degrees of freedom, which means we have a total of 4 categories, then we have 4 degrees of freedom (so 5 categories), then 5 degrees of freedom (so 6 categories), and finally 6 degrees of freedom (i.e., 7 categories). Note what happens; The more the degrees of freedom, the larger the `\(\chi^2\)` value needed to reject `\(H_0\)` with `\(\alpha=0.05\)` or `\(\alpha=0.01\)`. .pull-left[ <img src="stats101_2019_files/figure-html/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ The test is built on two assumptions: 1. No category has expected frequencies `\(< 1\)` 2. No more than 20% of the categories should have expected frequencies `\(< 5\)` ] --- ### An Example Assume that there are four gourmet meats placed before 100 subjects. In a blind taste test each subject is asked to pick the item they liked the most. Do subjects exhibit indifference between the four items? If they do, then we would expect about 25% to pick item A, 25% to pick item B, 25% to pick item C, and 25% to pick item D. If `\(H_0\)` were true then expected frequencies would be as follows for A, B, C, and D: ``` ## [1] 25 25 25 25 ``` Now assume that the 100 subjects actually indicated the following preferences for A, B, C, and D: ``` ## [1] 30 10 40 20 ``` Some of the difference between the <code>observed frequencies</code> and expected frequencies __if `\(H_0\)` were true (i.e., there were no clear preferences)__ could be by chance. Therefore we test whether the overall difference is enough to suggest that this could not happen by chance very often or if it could happen very often. In other words, do the data suggest that subjects _do prefer some items over the others_. We will set `\(\alpha = 0.05\)` and then conduct the test. --- ### Carrying out the test <table> <thead> <tr> <th style="text-align:right;"> observed </th> <th style="text-align:right;"> expected </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 25 </td> </tr> </tbody> </table> ``` ## ## Chi-squared test for given probabilities ## ## data: observed ## X-squared = 20, df = 3, p-value = 0.0001697 ``` Given that the `\(p-value\)` is less than `\(\alpha=0.05\)` we can reject the `\(H_0\)` that the items are equally preferred. That is, the data suggest that some items are preferable to others. --- ### `\(\chi^2\)` test of Association\Independence Often you are interested in testing for a relationship between two categorical variables * is smoking associated with lung cancer? * is a woman's educational attainment associated with fewer pregnancies? * does diabetes differ by race\ethnicity? Calculate, for each cell in the contingency table, `\(\dfrac{(f_{ij}-e_{ij})^{2}}{e_{ij}}\)` Add the resulting value over all cells. This yields `$$\chi^{2} = \sum_{i} \sum_{j} \dfrac{(f_{ij} - e_{ij})^{2}}{e_{ij}}$$` `\(\chi^{2} \sim df=(r-1)(c-1)\)` where ... `\(r=\)` `number of rows`, and `\(c=\)` `number of columns` If you have a `\(2 \times 2\)` table or small samples then Fisher's Exact test may be preferable --- ### Pigeons and Predation It has been hypothesized that the white rump of pigeons serves to distract predators like the peregrine falcons, and therefore it may be an adaptation to reduce predation. To test this, researchers followed the fate of 203 pigeons, 101 with white rumps and 102 with blue rumps. Nine of the white-rumped birds and 92 of the blue-rumped birds were killed by falcons. ``` ## ## blue white Sum ## killed 92 9 101 ## survived 10 92 102 ## Sum 102 101 203 ``` ``` ## ## blue white ## killed 0.91089109 0.08910891 ## survived 0.09803922 0.90196078 ``` The table suggests that 91% of the blue-rumped pigeons were killed versus only 9% of the white-rumped pigeons. --- > Do the two kinds of pigeons differ in their rate of capture by falcons? Carry out an appropriate test. `\(H_0:\)` Predation by falcons is independent of rump-color `\(H_A:\)` Predation by falcons is not independent of rump-color `\(\alpha=0.05\)` ``` ## ## Pearson's Chi-squared test ## ## data: tab.p ## X-squared = 134.13, df = 1, p-value < 2.2e-16 ``` ``` ## ## Fisher's Exact Test for Count Data ## ## data: tab.p ## p-value < 2.2e-16 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 33.70502 270.81972 ## sample estimates: ## odds ratio ## 89.50586 ``` Regardless of the test used, we can safely Reject `\(H_0\)` given that the `\(p-value \approx 0\)`. The data suggest that predation by falcons is not independent of rump-color --- ## Coffee and Sex at Birth | Sex at Birth | Light | Regular | Dark | Total | | :-- | :-- | :-- | :-- | --: | | Male | 20 | 40 | 20 | 80 | | Female | 30 | 30 | 10 | 70 | | Total | 50 | 70 | 30 | 150 | `Research Question:` Are coffee preferences independent of sex at birth (i.e., is there any association between coffee preferences and gender)? `\(H_{0}:\)` Coffee preference is independent of sex at birth `\(H_{1}:\)` Coffee preference is not independent of sex at birth --- `For each cell` in the contingency table, calculate `$$e_{ij} = \dfrac{\text{Row } i \text{ Total} \times \text{Column } j \text{ Total}}{\text{Sample Size}}$$` .pull-left[ `\(e_{11}=\dfrac{(80)(50)}{150}=\dfrac{4000}{150}=26.67\)` `\(e_{12}=\dfrac{(80)(70)}{150}=\dfrac{5600}{150}=37.33\)` `\(e_{13}=\dfrac{(80)(30)}{150}=\dfrac{2400}{150}=16.00\)` ] .pull-right[ `\(e_{21}=\dfrac{(70)(50)}{150}=\dfrac{3500}{150}=23.33\)` `\(e_{22}=\dfrac{(70)(70)}{150}=\dfrac{4900}{150}=32.67\)` `\(e_{23}=\dfrac{(70)(30)}{150}=\dfrac{2100}{150}=14.00\)` ] --- | Sex at birth | Coffee | `\(f_{i}\)` | `\(e_{i}\)` | `\((f_{i}-e_{i})\)` | `\((f_{i}-e_{i})^{2}\)` | `\((f_{i}-e_{i})^{2}/{e_{i}}\)`| | --: | --: | --: | --: | --: | --: | --: | |Male | Light | 20 | 26.67 | -6.67 | 44.49 | 1.67| |Male | Medium | 40 | 37.33 | 2.67 | 7.13 | 0.19| |Male | Dark | 20 | 16.00 | 4.00 | 16.00 | 1.00| |Female | Light | 30 | 23.33 | 6.67 | 44.49 | 1.91| |Female | Medium | 30 | 32.67 | -2.67 | 7.13 | 0.22| |Female | Dark | 10 | 14.00 | -4.00 | 16.00 | 1.14| | `\(\chi^{2}\)` | | | | | | 6.13| `\(df=(r-1)(c-1)=(2-1)(3-1)=(1)(2)=2\)` `\(p-value < 0.05\)`; Reject `\(H_{0}\)` Coffee preferences and gender are not independent --- class: inverse, middle, center # .salt[.fancy[Linear Regression]] --- ## Understanding Infant Mortality (2014) <img src="stats101_2019_files/figure-html/unnamed-chunk-18-1.png" width="65%" style="display: block; margin: auto;" /> --- ### Can we predict infant mortality from income? `$$y = \alpha + \beta(x)$$` `$$\text{Infant Mortality} = \alpha + \beta(Income)$$` <img src="stats101_2019_files/figure-html/unnamed-chunk-19-1.png" width="55%" style="display: block; margin: auto;" /> --- `$$\text{Infant Mortality} = 53.23 - 0.0016(Income)$$` As Income increases by 1 US Dollar, Infant Mortality drops by 0.0016 If Income rises by 1000 US Dollars, Infant Mortality drops by `\(0.0016(1000) = 1.6\)` What if a country has Income of 20,000? What would be the predicted Infant Mortality? `$$\text{Infant Mortality} = 53.23 - 0.0016(Income) \\ = 53.23 - 0.0016(20000) \\ = 53.23 - 32 \\ = 21.23$$` How good is this linear regression? (1) Income is a significant predictor ... `\(p-value = 0.000000107\)` (2) Adjusted `\(R^2 = 0.1884\)` ... This linear regression model `explains` about 18.84% of the variaton in infant mortality (3) Root Mean Squared Error `\(= 35.09\)` ... If we use this regression model to predict infant mortality, `average prediction error` would be `\(\pm 35.09\)` infant deaths High Adjusted `\(R^2\)`, small Root Mean Squared Error, and statistically significant predictor variable is desirable --- ### Improving the Linear Regression Model `$$y = \alpha + \beta_1(x_1) + \beta_2(x_2) + \ldots + \beta_k(x_k)$$` `$$y = \alpha + \beta_1(Income) + \beta_2(\text{Female Youth Literacy Rate}) + \beta_3(\text{% in Urban Areas})$$` ``` ## ## Call: ## lm(formula = U5MR ~ Income + FemaleYouthLR + Urban, data = sowc) ## ## Residuals: ## Min 1Q Median 3Q Max ## -41.570 -13.036 -4.449 4.948 94.270 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.810e+02 1.011e+01 17.901 <2e-16 *** ## Income -3.374e-04 2.411e-04 -1.399 0.1642 ## FemaleYouthLR -1.418e+00 1.236e-01 -11.467 <2e-16 *** ## Urban -2.686e-01 1.213e-01 -2.214 0.0286 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 23.22 on 129 degrees of freedom ## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6444 ## F-statistic: 80.74 on 3 and 129 DF, p-value: < 2.2e-16 ``` --- .pull-left[ <img src="stats101_2019_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="stats101_2019_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> ] * Holding all else constant, as Female Youth Literacy Rate rises by a unit amount infant mortality drops by 1.418 * Holding all else constant, as the % of the population living in Urban areas rises by a unit amount infant mortality drops by 0.02686 * Income does not appear to have a statistically significant impact on infant mortality * The model "explains" 64.44% of the variation in infant mortality * Average prediction error would be 23.22 --- ### Potential Issues * `Biased sample` (analysis will be worthless) * `Incorrect functional form` `\(\left(y = \alpha + \beta(\frac{1}{x^3}) \text{ but you fit } y = \alpha + \beta(x) \right)\)` * `Influential outliers` (extremely unusual data point may influence the linear regression line) * `Heteroscedastic errors` (discretionary spending of poor families will have little variance while that of wealthy families will have more variance) * `Correlated errors` (people in the same poor neighborhood are more likely to share adverse health outcomes) * `Measurement error` (measurement error in the independent variables) * `High Multicollinearity` (income and discretionary spending tend to be highly correlated so you cannot control for one while increasing the other by a unit amount) * `Non-normal distribution of errors` (you forgot to include some independent variables or have the wrong functional form or both) --- class: inverse, middle # .salt[.fancy[Binary y variable]] ### * Passenger survives `\((y = 1)\)` or dies `\((y = 0)\)` ### * Drug has an impact `\((y = 1)\)` or it does not `\((y = 0)\)` ### * Birth is of a girl `\((y = 1)\)` or of a boy `\((y = 0)\)` ### * Tumor has shrunk `\((y = 1)\)` or has not shrunk `\((y = 0)\)` --- ## Logit (Logistic) and Probit Models Goal is to model the `probability` of survival, drug impact, birth of a girl child, tumor shrinking, and so on .pull-left[ <img src="images/logitprobit_sim.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ * `\(Y_{i} \in \{0,1 \}, i=1,\ldots,N\)` * `\(P(Y_{i}=1|X_{i}) = \Phi(\sum b_{k}X_{ik})\)` ... Probit * `\(P(Y_{i}=1|X_{i}) = \dfrac{exp(b_{k}X_{ik})}{1 + exp(b_{k}X_{ik})}\)` ... Logit * `\(Y_{1}, Y_{2}, \ldots, Y_{N}\)` are statistically independent * `\(X_{ik}s\)` are not exactly or nearly linearly dependent ] ---
--- .pull-left[ ``` ## ## Call: ## glm(formula = Survived ~ Fare, family = binomial(link = "logit"), ## data = mydf2) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4558 -0.8985 -0.8625 1.3461 1.5344 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.911664 0.095817 -9.515 < 2e-16 *** ## Fare 0.014741 0.002219 6.644 3.06e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1171.1 on 875 degrees of freedom ## Residual deviance: 1105.7 on 874 degrees of freedom ## AIC: 1109.7 ## ## Number of Fisher Scoring iterations: 4 ``` ] .pull-right[ <img src="stats101_2019_files/figure-html/unnamed-chunk-26-1.png" width="100%" style="display: block; margin: auto;" /> ] --- .pull-left[ ``` ## ## Call: ## glm(formula = Survived ~ Fare + Female, family = binomial(link = "logit"), ## data = mydf2) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.1991 -0.6277 -0.5872 0.8123 1.9237 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.652604 0.148535 4.394 1.11e-05 *** ## Fare 0.011033 0.002289 4.820 1.44e-06 *** ## FemaleMale -2.408824 0.170896 -14.095 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1171.07 on 875 degrees of freedom ## Residual deviance: 876.04 on 873 degrees of freedom ## AIC: 882.04 ## ## Number of Fisher Scoring iterations: 4 ``` ] .pull-right[ <img src="stats101_2019_files/figure-html/unnamed-chunk-28-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### How good are the models? .pull-left[ `$$Survived = f(Fare)$$` ``` ## Bootstrapped (25 reps) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction 0 1 ## 0 56.5 29.9 ## 1 4.1 9.5 ## ## Accuracy (average) : 0.6599 ``` ] .pull-right[ `$$Survived = f(Fare, Female)$$` ``` ## Bootstrapped (25 reps) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction 0 1 ## 0 51.4 11.7 ## 1 10.2 26.8 ## ## Accuracy (average) : 0.7814 ``` ] #### .fat[Story] * Male's had a lower probability of survival than females, on average, and holding all else constant. * Wealthier passengers had a higher probability of survival than other passengers, on average, and holding all else constant --- Reference individual is a woman who paid the Median Fare. Compared to this individual, (a) Probability that the average Male who paid the same Median Fare of 14.4542 survived = 0.1660 (b) Probability that the average Female who paid the same Median Fare of 14.4542 survived = 0.6919 (c) Probability of survival was 0.6760 for the average Female who paid 7.91 `\((Q_1)\)` and 0.7300 if she paid 31.00 `\((Q_3)\)` (d) Probability of survival was 0.1561 for the average Male who paid `\(Q_1\)` Fare and 0.1934 if he paid `\(Q_3\)` Fare .pull-left[ | Fare | Male Probability | Female Probability | | :- | :- | :- | | 7.91 | 0.1561 | 0.6760 | | 14.45 | 0.1660 | 0.6910 | | 31.00 | 0.1934 | 0.7300 | ] .pull-right[ <img src="stats101_2019_files/figure-html/unnamed-chunk-31-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ### The Chivalry of the Seas <img src="stats101_2019_files/figure-html/unnamed-chunk-33-1.png" width="70%" style="display: block; margin: auto;" /> --- class: right, middle ## .left[.heat[.fancy[Questions??]]] <img class="circle" src="https://github.com/aniruhil.png" width="175px"/> # Find me at... [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @aruhil](http://twitter.com/aruhil) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> aniruhil.org](https://aniruhil.org) [<svg style="height:0.8em;top:.04em;position:relative;" viewBox="0 0 512 512"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> ruhil@ohio.edu](mailto:ruhil@ohio.edu)