Error, Effect Size, and Power (Lab 8)

First, complete the paper lab exam. Then do today’s lab! (Did you finish the lab on t-tests with dependent samples?)

Objectives

Today’s lab’s objectives are to:

Think through decision errors and how \(\alpha\) and \(\beta\) might be minimized
Understand how you can make decisions that reduce error
Learn about how error interacts with concepts we’re discussing in class like power, sample size, and effect sizes

There is again no answer sheet due. Don’t worry—we’ll have answers due for labs again starting next week.

Defining error

Your textbook describes a decision error as one where “the right procedures lead to the wrong decisions” (Aron, Coups, Aron, 2013, p. 177). I like this definition, which does a nice job of summing up what we mean when we talk about errors. They also remind us that researchers cannot usually tell when they have made an error. If they could; they wouldn’t make the error.

As we discussed in class, Type I error, indicated by the lowercase Greek letter alpha \(\alpha\), represents the likelihood of a false positive—it is equivalent to our cut-off level (p), and we want to reduce it as much as possible. One way this is often represented is in a t-table, where (as you can see below) the columns technically refer to the alpha level you have chosen. We normally choose to ask “is \(p<.05\)?” which is the same as choosing to look at the \(\alpha=.05\) column in the t-table.

df	α = .10	α = .05	α = .01
1	6.31	12.71	63.66
2	2.92	4.30	9.92
3	2.35	3.18	5.84
4	2.13	2.78	4.60
5	2.02	2.57	4.03
6	1.94	2.45	3.71
7	1.89	2.36	3.50
8	1.86	2.31	3.36
9	1.83	2.26	3.25
10	1.81	2.23	3.17
11	1.80	2.20	3.11
12	1.78	2.18	3.05
13	1.77	2.16	3.01
14	1.76	2.14	2.98
15	1.75	2.13	2.95
16	1.75	2.12	2.92
17	1.74	2.11	2.90
18	1.73	2.10	2.88
19	1.73	2.09	2.86
20	1.72	2.09	2.85
21	1.72	2.08	2.83
22	1.72	2.07	2.82
23	1.71	2.07	2.81
24	1.71	2.06	2.80
25	1.71	2.06	2.79
26	1.71	2.06	2.78
27	1.70	2.05	2.77
28	1.70	2.05	2.76
29	1.70	2.05	2.76
30	1.70	2.04	2.75
35	1.69	2.03	2.72
40	1.68	2.02	2.70
45	1.68	2.01	2.69
50	1.68	2.01	2.68
55	1.67	2.00	2.67
60	1.67	2.00	2.66
65	1.67	2.00	2.65
70	1.67	1.99	2.65
75	1.67	1.99	2.64
80	1.66	1.99	2.64
85	1.66	1.99	2.63
90	1.66	1.99	2.63
95	1.66	1.99	2.63
100	1.66	1.98	2.63
Infinity	1.64	1.96	2.58

We can also describe the probability of correctly retaining the null (if it is true) as \(1-\alpha\). So when we use \(\alpha=.05\) as our alpha level, we’re also giving ourselves a 95% chance of correctly retaining the null if it’s true.

Type II error (aka, false negatives), represents the likelihood of failing to correctly reject the null hypothesis when it is in fact false. We use the Greek lowercase letter beta (\(\beta\)) to represent type II error.

One of our goals in doing statistical tests in psychology is to minimize the false positive rate (Type I error, \(\alpha\)) by setting an appropriate cut-off (e.g., testing to see if \(p<.05\) [\(\alpha=.05\)] where appropriate). We can reduce our \(\alpha\) by just using a smaller one. If you scroll up to the t-table, you’ll see what happens—our critical t-value gets larger. So at \(df=50\), for example, \(t_{crit}(50)=\pm2.01\) for \(\alpha=.05\) but \(t_{crit}(50)=\pm2.68\) for \(\alpha=.01\). That’s a higher hurdle to surpass—we are literally saying that we will only conclude that we have a statistically significant test if our results have a 1% or less chance of happening under the null. So we need a higher t-value to be able to make that conclusion. See the image below to remind yourself—the darker red shading is for the \(\frac{1}{2}\)% at either tail (-2.68 or +2.68), whereas the lighter red is for 2.5% in either tail (-2.01 and +2.01). (Reminder: we’re using \(\frac{1}{2}\)% in each tail because that’s a total of 1%, or 2.5% in each tail because that’s a total of 5%.)

When we set \(\alpha\) to .05, or to .01, we are trying to make sure that only 5% or 1% of “true null” hypotheses will be incorrectly rejected. We are trying to reduce our Type I error. But the error of failing to reject the null when we should in fact reject it (it’s really the case that there is an effect) is also a problem!

Thus, another goal of our hypothesis-testing procedure in psychology is to minimize the false negative rate (Type II error, or \(\beta\)) by maximizing the power of the test, represented by \(1-\beta\). That is, if there is an effect, we want to find that effect! But how likely we are to find a true effect depends on how big that effect is. What is power? We’ll discuss it more in class, but it can be defined as the probability that we will correctly reject the null hypothesis.

So while we talk about minimizing \(\alpha\), we talk about maximizing power (\(1-\beta\)). This does mean that we’re also trying to minimize \(\beta\); it’s just the other side of it. Navarro and Foxcroft have a description of why getting an exact number for \(\beta\) is difficult. (It’s because there are many possibilities.)

As they go on to explain, when there’s a very clear difference between what exists in the word and what the null hypothesis predicts, then you will have a very high power, but if the true state of the world is not that the null is true but quite close, then you’ll have very low power to find that. (Your chance of a Type II error will be high!) We use the idea of effect size to try to quantify this distance.

Effect size

The effect size is a standardized measure of the difference between populations (usually the means). For example, an effect size might be how much your measure of depression symptoms changes after an intervention. It’s not the t-test, but it can be similar to one in some ways.

We only really care about the effect size when we find statistically significant results. If there is no significance, then in principle there is no true difference between the population means. So we do not report effect size. In other words, here’s a table we’ll discuss in class, too.

	Big effect size	Small effect size
Significant test results	Difference is real—and matters	Difference is real—but may not be that interesting
Non-significant test results	No effect is observed	No effect is observed

That is to say: if your test results don’t show statistical significance, the effect size doesn’t matter because no effect was observed. Don’t report it or try to interpret it.

One common measure of effect size is called Cohen’s d; it’s named after Jacob Cohen, who we’ve mentioned in class before. Cohen’s d is calculated by dividing the difference between the means by an estimate of the standard deviation. So it’s quite like a t-score, but rather than standard error, the SD is on the bottom:

\[d=\frac{\textrm{mean 1}-\textrm{mean 2}}{SD}\]

Cohen’s d can be roughly interpreted as follows:

Cohen’s \(d=0.20\): “small” effect
Cohen’s \(d=0.50\): “medium” effect
Cohen’s \(d=0.80\): “large” effect

These are rough estimates. In some situations, an effect of Cohen’s \(d=.30\) can be considered about as large as you would expect.

We can get this measure of effect size very easily. Try it out by opening the fisher dataset we worked with in the lab on dependent means in Jamovi. Re-run the paired sampels t-test to compare HRSD pre and post scores. Under Additional Statistics on the box for Effect size. Check to also get a Confidence interval on the effect size. Also check on the Descriptives (right below that).

What should I see?

First, you will see the results of the t-test we saw in last week’s lab. Because the results are statistically significant (\(p<.05\), which is our alpha-level for this test), we can then interpret the effect size.

			statistic	df	p
Paired Samples T-Test
hrsd.pre	hrsd.post	Student's t	12.6	31.0	<.001

To the right of the test results, you will also see the Cohen’s d effect size measure, which I include here. The SD in the denominator of the equation here is the standard deviation of the difference scores.

	Effect Size	95% Confidence Interval
	Effect Size	Lower	Upper
Cohen's d	2.23	1.57	2.88

This is a very large effect size! The 95% confidence interval, here, is the range of scores likely to include the true effect size. One way to think about this: if this experiment was conducted 100 times, and each time we computed the 95% confidence interval around the effect size, 95% of them would contain the true effect size.

If the confidence interval around Cohen’s d overlaps with 0, that suggests you likely don’t have a meaningful effect. Here, though, even the lower bound of our confidence interval is a large effect.

Reporting Cohen’s d

How do you report this? Just tack it on after the \(p<.05\):

Depression scores dropped from baseline (\(M=13.8\)) to post-treatment (\(M=5.78\)) on the HRSD, \(t(31)=12.6,p<.05\), Cohen’s \(d=2.23\), 95% CI [1.57, 2.88].

Those brackets are how 95% confidence intervals tend to be reported, too.

Why don’t I write \(p<.001\)? (click to expand)

You might want to report that \(p<.001\), since that’s what Jamovi told you. In some circumstances, that’s appropriate. I recommend defaulting, however, to reporting the answer to the question you were asking.

Power

Remember that power is the probability that we will correctly reject the null hypothesis.

There are three factors that most affect statistical power:

Effect size: It’s easier to find large effects than small effects
Sample size: The larger the sample, the greater the power
Your alpha (Type I error rate): Decreasing your Type I error rate by using a smaller \(\alpha\) will also decrease power

Before you continue, make sure you have working definitions of the following terms. You might want to chat with classmates about these.

Type I error
Type II error
Effect size
Power
\(\alpha\)
\(\beta\)

Go to this website, which is an interactive visualization that will let you manipulate power, alpha, sample size (n), and effect size (Cohen’s d). Think/talk through what the two different distributions on the website represent. What do the shaded colors (red, light blue and dark blue) indicate? Play around with some settings, and see what happens. (Don’t ignore the “reset zoom” button!)

Solve for power

Use the website to solve for power in the following situations. Copy each table to your notes and fill in the power column.

Effect size

	Effect size (d)	Sample size	\(\alpha\)
Case 1	.20	30	.05
Case 2	.50	30	.05
Case 3	.80	30	.05

How does effect size affect power? Why is this the case?

Sample size

	Effect size (d)	Sample size	\(\alpha\)
Case 1	.30	10	.05
Case 2	.30	20	.05
Case 3	.30	40	.05

How does sample size affect power? Why?

Alpha

	Effect size (d)	Sample size	\(\alpha\)
Case 1	.30	30	.01
Case 2	.30	30	.05
Case 3	.30	30	.10

How does alpha level affect power? Why?

What sort of conclusions did you reach?

Number of tails

What happens if you play around with one vs. two-tailed tests? Note that we still almost always want to do a two-tailed test! So all we’re really doing by switching from two-tailed to one-tailed here is doubling our alpha level.

What might I be concluding? (click to expand)

I hope that you’ve seen that power increases in the following scenarios:

When the effect size is larger
When the sample size is larger
When the alpha level is bigger
And even when we switch to a one-tailed test

However, there are challenges with manipulating these:

You should almost always use a one-tailed test
Some effects just aren’t large
We don’t want to increase our alpha, which is the Type I error rate

The only thing we truly have the ability to do regularly is to increase our sample size, or at least get one large enough to find an effect if it exists. And we can do the most we can to find the largest effect of interest. We can also make sure our research design is clever and sensitive enough to detect the effect.

Every experiment you run should, before you begin, have you thinking about power. Do you have enough power to detect the effect you’re interested in? (Is your sample size large enough?) Generally speaking, a small sample size will only find a large effect. A very large sample size can find really small effects, even if they’re not meaningful. We’ll come back to this in class.

End by reading Matthew Crump’s considerations on power. Thinking through these scenarios will help you understand the relationship of the effect size, sample size, and error.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{dainer-best2025,
  author = {Dainer-Best, Justin},
  title = {Error, {Effect} {Size,} and {Power} {(Lab} 8)},
  date = {2025-10-23},
  url = {https://faculty.bard.edu/jdainerbest/stats/labs/posts/09-error/},
  langid = {en}
}

For attribution, please cite this work as:

Dainer-Best, Justin. 2025. “Error, Effect Size, and Power (Lab 8).” October 23, 2025. https://faculty.bard.edu/jdainerbest/stats/labs/posts/09-error/.