ST1131 Introduction to Statistics and Statistical Computing

Concepts I Often Forget or Get Mixed Up

Note: Taken in AY2024/25 Semester 1

Central Tendency and Variability

Mean, standard deviation, and variance can be sensitive to outliers, while the median and interquartile range are more robust and less affected by them.
When you transform data linearly from $X$ to $Y = aX + b$, the mean changes from $\overline{X}$ to $a\overline{X} + b$, and the variance changes from $S^2$ to $a^2 S^2$.
For two random variables $X$ and $Y$, there are few interesting properties to note:
- Linearity of Expectation: $E(X\pm Y)=E(X)\pm E(Y)$.
- Linearity of Variance: $\text{Var}(X\pm Y)=\text{Var}(X)+\text{Var}(Y)$.

Conditional Probability

Conditional probability is the probability of an event $A$ occurring given that another event $B$ has already occurred. It’s expressed as:

\[P(A | B) = \frac{P(A \cap B)}{P(B)}\]

Law of Total Probability

The Law of Total Probability helps us find the probability of an event $A$ by considering all possible ways it could occur, broken down by other events ($B_1, B_2, …, B_n$):

\[P(A) = P(A \cap B_1) + P(A \cap B_2) + ... + P(A \cap B_n)\]

Bayes’ Theorem

Bayes’ Theorem is a formula that helps us update the probability of an event $B_i$ based on new evidence ($A$). It’s written as:

\[P(B_i | A) = \frac{P(A | B_i) P(B_i)}{P(A | B_1) P(B_1) + P(A | B_2) P(B_2) + ... + P(A | B_n) P(B_n)}\]

Some Notes on Probability Concepts

Events $A$ and $B$ are said to be independent if and only if $P(A\cap B)=P(A)P(B)$.
Events $A$ and $B$ are said to be (mutually) disjoint events if and only if $P(A\cap B)=0$, which means the events cannot happen together.

Binomial Distribution

The number of ways to choose $k$ successes out of $n$ trials is given by the combination formula:

\[\binom{n}{k} = \frac{n!}{k!(n-k)!}\]

3 Conditions for Binomial Distribution:

There are $n$ trials, each with two possible outcomes (success or failure).
Each trial has a probability $p$ of success.
All trials are independent of each other.

Binomial Random Variable

The number of successes in $n$ trials is modeled by a Binomial Distribution: $Bin(n, p)$.
A Bernoulli Distribution is a special case of the binomial distribution when $n = 1$: $Bin(1, p)$. The sum of independent Bernoulli random variables follows a Binomial Distribution.

Binomial Formula (for $X \sim \text{Bin}(n, p)$):

The probability of exactly $x$ successes is given by: $P(X = x) = \binom{n}{x} p^x (1 - p)^{n - x}$
The Expected Value (mean) of $X$ is: $E(X) = np$.
The Variance of $X$ is: $\text{Var}(X) = np(1 - p)$.

Point Estimate

$\overline{X} \to \mu$ and $\hat{p} \to p$ represent point estimates.
Point estimates don’t show how close they are to the true value (population parameters).

Standard Error and Margin of Error

These terms are quite similar, but they differ slightly in how they’re used:

Standard Error (SE) is the standard deviation of the measured sample mean. A lower SE indicates more accurate results.
Margin of Error (MOE) represents the range within which we expect the true population parameter to fall, given a certain confidence level. A smaller MOE suggests more precise estimates.

Confidence Intervals (CI)

In the long run, 95% of intervals will contain the true population parameter. CI = Point Estimate $\pm$ Margin of Error. The width of confidence interval ($D$) $=2\times$ MOE.

Confidence Interval for Proportion

To find the CI given a confidence level ($x$):

Calculate $\hat{p}$ and check if $n\hat{p}(1 - \hat{p}) \geq 5$.
Let $\alpha = 1 - x$.
CI $=\hat{p} \pm Z_{1-\frac{\alpha}{2}} \times \sqrt{\displaystyle\frac{\hat{p}(1 - \hat{p})}{n}}$.

Determine Sample Size ($n$) Before Study:

Decide on the confidence level ($x$) and the width of the CI ($D$).
Use the formula: $n \geq \left(\displaystyle\frac{2Z_{1-\frac{\alpha}{2}}}{D}\right)^2 \times \hat{p}(1 - \hat{p})$, where $\hat{p} = 0.5$.

Confidence Interval for Mean

t-distribution ($t_{\text{df}}$) approaches $N(0, 1)$ as degrees of freedom ($df$) increase. To find the CI given a confidence level ($x$):

Assumptions: The sample is random (not biased), and the data distribution is symmetric (or the sample size is large enough).
CI $=\overline{X} \pm t_{n-1; 1-\frac{\alpha}{2}} \times \displaystyle\frac{s}{\sqrt{n}}$.

Determine Sample Size ($n$) Before Study:

Decide on the confidence level ($x$) and the width of the CI ($D$).
Use the formula: $n \geq \left(\displaystyle\frac{2Z_{1-\frac{\alpha}{2}} \times s}{D}\right)^2$.
For $s$, look for similar studies (as given in the context). Ensure $n \geq 30$ to apply the t-distribution comfortably.

Hypothesis Testing

Hypothesis testing is all about making decisions based on data. The main idea is to test a null hypothesis ($H_0$) against an alternative hypothesis ($H_1$). Here’s a breakdown of the key terms:

Null hypothesis ($H_0$) vs. Alternative hypothesis ($H_1$)
Test statistic: How far the point estimate is from our initial guess (the null hypothesis).
Null distribution: The distribution of the test statistic under $H_0$.
p-Value: This tells you how unlikely your observed result is if $H_0$ is true.
Significance level ($\alpha$): If the p-Value is less than or equal to α, you reject $H_0$.
A test is statistically significant when we reject $H_0$.
Type I Error: Rejecting $H_0$ when it’s actually true.
Type II Error: Not rejecting $H_0$ when it is false.
Increasing your sample size can help reduce both types of errors.

One Sample, Proportion

Here’s how you would test a proportion in one sample:

Assumptions: The data is categorical, random, and we have $np_0(1 - p_0) \geq 5$.
Hypothesis:
- $H_0: p = p_0$
- $H_1: p \neq p_0$
Test statistic: $z = \displaystyle\frac{p - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}$ where $z \sim N(0,1)$.
p-Value:
- For a right-sided test: $P(z \geq \text{Test stat.} \| z \sim N(0,1))$
- For a two-sided test: $2 \times P(z \geq \text{Test stat.} \| z \sim N(0,1))$
Conclusion:
If the p-Value $\leq \alpha$, reject $H_0$. Otherwise, you cannot reject $H_0$.

One Sample, Mean

Testing the mean of one sample:

Assumptions: The data is quantitative, random, and normally distributed (or $n \geq 30$).
Hypothesis:
- $H_0: \mu = \mu_0$
- $H_1: \mu \neq \mu_0$
Test statistic: $T = \displaystyle\frac{\overline{X} - \mu_0}{s / \sqrt{n}}$ where $T \sim t_{n-1}(0,1)$.
p-Value and Conclusion: This works the same as the proportion test.
- The result of a two-sided test for the mean is equivalent to using a confidence interval.

Two Sample, Independent, Equal Variance

When comparing the means of two independent samples with equal variances:

Assumptions: The data is quantitative, random, independent, and the population distribution is approximately normal (or $n$ is large enough). For the equal variance test, $p > 0.05$.
Hypothesis:
- $H_0: \mu_1 = \mu_2$
- $H_1: \mu_1 \neq \mu_2$
Test statistic: $T = \displaystyle\frac{\overline{X} - \overline{Y}}{se}$ where $se = \displaystyle\sqrt{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}$ and $s_p^2 = \displaystyle\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}$ (Pooled estimate of common variance).
Where $T \sim t_{n_1 + n_2 - 2}$.

Two Sample, Independent, Unequal Variance

This is similar to the previous test, but with unequal variances:

Assumptions: Same as above, but the population variances are different.
Hypothesis:
- $H_0: \mu_1 = \mu_2$
- $H_1: \mu_1 \neq \mu_2$
Test statistic: $T = \displaystyle\frac{\overline{X} - \overline{Y}}{se}$ where $se = \displaystyle\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$ and $T \sim t_{\text{df}}$.

Two Sample, Dependent

When the two samples are dependent (e.g., before and after comparisons):

Each observation has a matching pair.
Take the difference of the paired observations and compare the mean of those differences to 0. This is similar to performing a one-sample test.

QQ Plot - Check Normality

A QQ plot helps check if your data follows a normal distribution.

Right tail below/above the line: This indicates a longer/shorter right tail.
Left tail below/above the line: This indicates a shorter/longer left tail.

Implementation of Source §4 Streams from Scratch in JavaScript

Interesting Problems from CS1231S Past Year Papers