| df | α = .10 | α = .05 | α = .01 | 
|---|---|---|---|
| 1 | 6.31 | 12.71 | 63.66 | 
| 2 | 2.92 | 4.30 | 9.92 | 
| 3 | 2.35 | 3.18 | 5.84 | 
| 4 | 2.13 | 2.78 | 4.60 | 
| 5 | 2.02 | 2.57 | 4.03 | 
| 6 | 1.94 | 2.45 | 3.71 | 
| 7 | 1.89 | 2.36 | 3.50 | 
| 8 | 1.86 | 2.31 | 3.36 | 
| 9 | 1.83 | 2.26 | 3.25 | 
| 10 | 1.81 | 2.23 | 3.17 | 
| 11 | 1.80 | 2.20 | 3.11 | 
| 12 | 1.78 | 2.18 | 3.05 | 
| 13 | 1.77 | 2.16 | 3.01 | 
| 14 | 1.76 | 2.14 | 2.98 | 
| 15 | 1.75 | 2.13 | 2.95 | 
| 16 | 1.75 | 2.12 | 2.92 | 
| 17 | 1.74 | 2.11 | 2.90 | 
| 18 | 1.73 | 2.10 | 2.88 | 
| 19 | 1.73 | 2.09 | 2.86 | 
| 20 | 1.72 | 2.09 | 2.85 | 
| 21 | 1.72 | 2.08 | 2.83 | 
| 22 | 1.72 | 2.07 | 2.82 | 
| 23 | 1.71 | 2.07 | 2.81 | 
| 24 | 1.71 | 2.06 | 2.80 | 
| 25 | 1.71 | 2.06 | 2.79 | 
| 26 | 1.71 | 2.06 | 2.78 | 
| 27 | 1.70 | 2.05 | 2.77 | 
| 28 | 1.70 | 2.05 | 2.76 | 
| 29 | 1.70 | 2.05 | 2.76 | 
| 30 | 1.70 | 2.04 | 2.75 | 
| 35 | 1.69 | 2.03 | 2.72 | 
| 40 | 1.68 | 2.02 | 2.70 | 
| 45 | 1.68 | 2.01 | 2.69 | 
| 50 | 1.68 | 2.01 | 2.68 | 
| 55 | 1.67 | 2.00 | 2.67 | 
| 60 | 1.67 | 2.00 | 2.66 | 
| 65 | 1.67 | 2.00 | 2.65 | 
| 70 | 1.67 | 1.99 | 2.65 | 
| 75 | 1.67 | 1.99 | 2.64 | 
| 80 | 1.66 | 1.99 | 2.64 | 
| 85 | 1.66 | 1.99 | 2.63 | 
| 90 | 1.66 | 1.99 | 2.63 | 
| 95 | 1.66 | 1.99 | 2.63 | 
| 100 | 1.66 | 1.98 | 2.63 | 
| Infinity | 1.64 | 1.96 | 2.58 | 
First, complete the paper lab exam. Then do today’s lab! (Did you finish the lab on t-tests with dependent samples?)
Objectives
Today’s lab’s objectives are to:
- Think through decision errors and how \(\alpha\) and \(\beta\) might be minimized
- Understand how you can make decisions that reduce error
- Learn about how error interacts with concepts we’re discussing in class like power, sample size, and effect sizes
There is again no answer sheet due. Don’t worry—we’ll have answers due for labs again starting next week.
Defining error
Your textbook describes a decision error as one where “the right procedures lead to the wrong decisions” (Aron, Coups, Aron, 2013, p. 177). I like this definition, which does a nice job of summing up what we mean when we talk about errors. They also remind us that researchers cannot usually tell when they have made an error. If they could; they wouldn’t make the error.
As we discussed in class, Type I error, indicated by the lowercase Greek letter alpha \(\alpha\), represents the likelihood of a false positive—it is equivalent to our cut-off level (p), and we want to reduce it as much as possible. One way this is often represented is in a t-table, where (as you can see below) the columns technically refer to the alpha level you have chosen. We normally choose to ask “is \(p<.05\)?” which is the same as choosing to look at the \(\alpha=.05\) column in the t-table.
We can also describe the probability of correctly retaining the null (if it is true) as \(1-\alpha\). So when we use \(\alpha=.05\) as our alpha level, we’re also giving ourselves a 95% chance of correctly retaining the null if it’s true.
Type II error (aka, false negatives), represents the likelihood of failing to correctly reject the null hypothesis when it is in fact false. We use the Greek lowercase letter beta (\(\beta\)) to represent type II error.
One of our goals in doing statistical tests in psychology is to minimize the false positive rate (Type I error, \(\alpha\)) by setting an appropriate cut-off (e.g., testing to see if \(p<.05\) [\(\alpha=.05\)] where appropriate). We can reduce our \(\alpha\) by just using a smaller one. If you scroll up to the t-table, you’ll see what happens—our critical t-value gets larger. So at \(df=50\), for example, \(t_{crit}(50)=\pm2.01\) for \(\alpha=.05\) but \(t_{crit}(50)=\pm2.68\) for \(\alpha=.01\). That’s a higher hurdle to surpass—we are literally saying that we will only conclude that we have a statistically significant test if our results have a 1% or less chance of happening under the null. So we need a higher t-value to be able to make that conclusion. See the image below to remind yourself—the darker red shading is for the \(\frac{1}{2}\)% at either tail (-2.68 or +2.68), whereas the lighter red is for 2.5% in either tail (-2.01 and +2.01). (Reminder: we’re using \(\frac{1}{2}\)% in each tail because that’s a total of 1%, or 2.5% in each tail because that’s a total of 5%.)

When we set \(\alpha\) to .05, or to .01, we are trying to make sure that only 5% or 1% of “true null” hypotheses will be incorrectly rejected. We are trying to reduce our Type I error. But the error of failing to reject the null when we should in fact reject it (it’s really the case that there is an effect) is also a problem!
Thus, another goal of our hypothesis-testing procedure in psychology is to minimize the false negative rate (Type II error, or \(\beta\)) by maximizing the power of the test, represented by \(1-\beta\). That is, if there is an effect, we want to find that effect! But how likely we are to find a true effect depends on how big that effect is. What is power? We’ll discuss it more in class, but it can be defined as the probability that we will correctly reject the null hypothesis.
So while we talk about minimizing \(\alpha\), we talk about maximizing power (\(1-\beta\)). This does mean that we’re also trying to minimize \(\beta\); it’s just the other side of it. Navarro and Foxcroft have a description of why getting an exact number for \(\beta\) is difficult. (It’s because there are many possibilities.)
As they go on to explain, when there’s a very clear difference between what exists in the word and what the null hypothesis predicts, then you will have a very high power, but if the true state of the world is not that the null is true but quite close, then you’ll have very low power to find that. (Your chance of a Type II error will be high!) We use the idea of effect size to try to quantify this distance.
Effect size
The effect size is a standardized measure of the difference between populations (usually the means). For example, an effect size might be how much your measure of depression symptoms changes after an intervention. It’s not the t-test, but it can be similar to one in some ways.
We only really care about the effect size when we find statistically significant results. If there is no significance, then in principle there is no true difference between the population means. So we do not report effect size. In other words, here’s a table we’ll discuss in class, too.
| Big effect size | Small effect size | |
|---|---|---|
| Significant test results | Difference is real—and matters | Difference is real—but may not be that interesting | 
| Non-significant test results | No effect is observed | No effect is observed | 
That is to say: if your test results don’t show statistical significance, the effect size doesn’t matter because no effect was observed. Don’t report it or try to interpret it.
One common measure of effect size is called Cohen’s d; it’s named after Jacob Cohen, who we’ve mentioned in class before. Cohen’s d is calculated by dividing the difference between the means by an estimate of the standard deviation. So it’s quite like a t-score, but rather than standard error, the SD is on the bottom:
\[d=\frac{\textrm{mean 1}-\textrm{mean 2}}{SD}\]
Cohen’s d can be roughly interpreted as follows:
- Cohen’s \(d=0.20\): “small” effect
- Cohen’s \(d=0.50\): “medium” effect
- Cohen’s \(d=0.80\): “large” effect
These are rough estimates. In some situations, an effect of Cohen’s \(d=.30\) can be considered about as large as you would expect.
We can get this measure of effect size very easily. Try it out by opening the fisher dataset we worked with in the lab on dependent means in Jamovi. Re-run the paired sampels t-test to compare HRSD pre and post scores. Under Additional Statistics on the box for Effect size. Check to also get a Confidence interval on the effect size. Also check on the Descriptives (right below that).
What should I see?
First, you will see the results of the t-test we saw in last week’s lab. Because the results are statistically significant (\(p<.05\), which is our alpha-level for this test), we can then interpret the effect size.
| Paired Samples T-Test | |||||
| statistic | df | p | |||
|---|---|---|---|---|---|
| hrsd.pre | hrsd.post | Student's t | 12.6 | 31.0 | <.001 | 
To the right of the test results, you will also see the Cohen’s d effect size measure, which I include here. The SD in the denominator of the equation here is the standard deviation of the difference scores.
| Effect Size | 
95% Confidence Interval
 | ||
|---|---|---|---|
| Lower | Upper | ||
| Cohen's d | 2.23 | 1.57 | 2.88 | 
This is a very large effect size! The 95% confidence interval, here, is the range of scores likely to include the true effect size. One way to think about this: if this experiment was conducted 100 times, and each time we computed the 95% confidence interval around the effect size, 95% of them would contain the true effect size.
If the confidence interval around Cohen’s d overlaps with 0, that suggests you likely don’t have a meaningful effect. Here, though, even the lower bound of our confidence interval is a large effect.
Reporting Cohen’s d
How do you report this? Just tack it on after the \(p<.05\):
Depression scores dropped from baseline (\(M=13.8\)) to post-treatment (\(M=5.78\)) on the HRSD, \(t(31)=12.6,p<.05\), Cohen’s \(d=2.23\), 95% CI [1.57, 2.88].
Those brackets are how 95% confidence intervals tend to be reported, too.
You might want to report that \(p<.001\), since that’s what Jamovi told you. In some circumstances, that’s appropriate. I recommend defaulting, however, to reporting the answer to the question you were asking.
Power
Remember that power is the probability that we will correctly reject the null hypothesis.
There are three factors that most affect statistical power:
- Effect size: It’s easier to find large effects than small effects
- Sample size: The larger the sample, the greater the power
- Your alpha (Type I error rate): Decreasing your Type I error rate by using a smaller \(\alpha\) will also decrease power
Before you continue, make sure you have working definitions of the following terms. You might want to chat with classmates about these.
- Type I error
- Type II error
- Effect size
- Power
- \(\alpha\)
- \(\beta\)
Go to this website, which is an interactive visualization that will let you manipulate power, alpha, sample size (n), and effect size (Cohen’s d). Think/talk through what the two different distributions on the website represent. What do the shaded colors (red, light blue and dark blue) indicate? Play around with some settings, and see what happens. (Don’t ignore the “reset zoom” button!)
Solve for power
Use the website to solve for power in the following situations. Copy each table to your notes and fill in the power column.
Effect size
| Effect size (d) | Sample size | \(\alpha\) | Power (you fill in) | |
|---|---|---|---|---|
| Case 1 | .20 | 30 | .05 | |
| Case 2 | .50 | 30 | .05 | |
| Case 3 | .80 | 30 | .05 | 
How does effect size affect power? Why is this the case?
Sample size
| Effect size (d) | Sample size | \(\alpha\) | Power (you fill in) | |
|---|---|---|---|---|
| Case 1 | .30 | 10 | .05 | |
| Case 2 | .30 | 20 | .05 | |
| Case 3 | .30 | 40 | .05 | 
How does sample size affect power? Why?
Alpha
| Effect size (d) | Sample size | \(\alpha\) | Power (you fill in) | |
|---|---|---|---|---|
| Case 1 | .30 | 30 | .01 | |
| Case 2 | .30 | 30 | .05 | |
| Case 3 | .30 | 30 | .10 | 
How does alpha level affect power? Why?
What sort of conclusions did you reach?
Number of tails
What happens if you play around with one vs. two-tailed tests? Note that we still almost always want to do a two-tailed test! So all we’re really doing by switching from two-tailed to one-tailed here is doubling our alpha level.
I hope that you’ve seen that power increases in the following scenarios:
- When the effect size is larger
- When the sample size is larger
- When the alpha level is bigger
- And even when we switch to a one-tailed test
However, there are challenges with manipulating these:
- You should almost always use a one-tailed test
- Some effects just aren’t large
- We don’t want to increase our alpha, which is the Type I error rate
The only thing we truly have the ability to do regularly is to increase our sample size, or at least get one large enough to find an effect if it exists. And we can do the most we can to find the largest effect of interest. We can also make sure our research design is clever and sensitive enough to detect the effect.
Every experiment you run should, before you begin, have you thinking about power. Do you have enough power to detect the effect you’re interested in? (Is your sample size large enough?) Generally speaking, a small sample size will only find a large effect. A very large sample size can find really small effects, even if they’re not meaningful. We’ll come back to this in class.
End by reading Matthew Crump’s considerations on power. Thinking through these scenarios will help you understand the relationship of the effect size, sample size, and error.
Reuse
Citation
@online{dainer-best2025,
  author = {Dainer-Best, Justin},
  title = {Error, {Effect} {Size,} and {Power} {(Lab} 8)},
  date = {2025-10-23},
  url = {https://faculty.bard.edu/jdainerbest/stats/labs/posts/09-error/},
  langid = {en}
}