A measure of central tendency gives us a single representative value that summarises an entire dataset. It answers: "What is the typical or central value?" The three main measures are the Mean, Median, and Mode โ each with unique strengths and use-cases.
The arithmetic mean is the sum of all observations divided by the number of observations. It is the most widely used measure and is the "balance point" of a distribution.
For Grouped Data: xฬ = ฮฃ(fแตข ร mแตข) / ฮฃfแตข
The median is the middle value when data is arranged in ascending order. For an even number of observations, it's the mean of the two middle values. The median is not affected by extreme values (outliers), making it ideal for skewed distributions like income and house prices.
Even n: Median = average of values at n/2 and (n/2)+1
For Grouped Data (Ogive method):
Median = L + [(n/2 โ cf) / f] ร h
The mode is the value that appears most often in a dataset. A distribution can be unimodal (one mode), bimodal (two modes), or multimodal. Mode is the only measure applicable to categorical/nominal data (e.g., most popular colour, most common occupation).
L=lower boundary of modal class | fโ=modal class freq | fโ=preceding class freq | fโ=succeeding class freq | h=class width
The geometric mean is the nth root of the product of n values. It is used when values are multiplicative in nature โ like compound interest, population growth rates, and investment returns. It is always โค Arithmetic Mean (AMโGM inequality).
Equivalent using logarithms:
log(GM) = [log(xโ) + log(xโ) + ... + log(xโ)] / n
Using values: 1.20, 0.90, 1.15 โ GM = โ(1.20 ร 0.90 ร 1.15) = โ1.242 โ 1.075 โ 7.5% CAGR.
Arithmetic mean gives (20โ10+15)/3 = 8.33% โ which overstates returns!
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. It's used when dealing with rates โ speed, frequency, P/E ratios in finance. It gives more weight to smaller values.
When different values carry different levels of importance (weights), we use the weighted mean. It is the foundation of index numbers, GPA calculations, and portfolio return calculations.
Example: Marks(Values) and Credits(Weights) for 4 subjects
| Subject | Marks (xแตข) | Credits (wแตข) | wแตข ร xแตข |
|---|---|---|---|
| Economics | 85 | 4 | 340 |
| Statistics | 72 | 3 | 216 |
| Finance | 90 | 4 | 360 |
| History | 65 | 2 | 130 |
| Total | โ | 13 | 1046 |
| Situation | Best Measure | Why |
|---|---|---|
| Exam scores, heights, temperatures | Arithmetic Mean | Symmetric, no outliers |
| Income, house prices, wealth | Median | Skewed distribution, outliers |
| Shoe sizes, favourite colour, most common profession | Mode | Categorical / nominal data |
| Investment returns, population growth, CAGR | Geometric Mean | Multiplicative growth |
| Speeds, rates, P/E ratio averaging | Harmonic Mean | Rate-based data |
| GPA, portfolio returns, index numbers | Weighted Mean | Unequal importance |
Two datasets can have the same mean but completely different spreads. Student A scores: 48, 50, 50, 52 (Mean=50). Student B scores: 10, 50, 90, 50 (Mean=50). Same mean โ but Student B is wildly inconsistent!
Mean Deviation (from mean) = ฮฃ|xแตข โ xฬ| / n
Mean Deviation (from median) = ฮฃ|xแตข โ M| / n
Standard deviation is the most important measure of dispersion in statistics. It quantifies the average spread of data around the mean. Population vs Sample formulas differ by the denominator (N vs nโ1).
Sample Variance: sยฒ = ฮฃ(xแตข โ xฬ)ยฒ / (nโ1) โ Bessel's correction
Standard Deviation: ฯ = โ[ฮฃ(xแตข โ ฮผ)ยฒ / N]
Shortcut formula: ฯยฒ = ฮฃxแตขยฒ/N โ (ฮฃxแตข/N)ยฒ = E(Xยฒ) โ [E(X)]ยฒ
The Interquartile Range (IQR) is the range of the middle 50% of data. Q1 (25th percentile), Q2 (Median), Q3 (75th percentile). A box plot visualises these five-number statistics: Min, Q1, Median, Q3, Max.
Q3 = value at 3(n+1)/4 position
IQR = Q3 โ Q1
Outlier bounds: < Q1 โ 1.5รIQR or > Q3 + 1.5รIQR
CV allows comparison of spread across datasets with different units or scales. It expresses standard deviation as a percentage of the mean. Lower CV = more consistent.
The normal distribution is the most important distribution in statistics. Many natural phenomena โ heights, exam scores, measurement errors โ follow it. It is symmetric, bell-shaped, and completely defined by its mean (ฮผ) and standard deviation (ฯ).
Empirical Rule (68-95-99.7 Rule):
ฮผ ยฑ 1ฯ covers 68.27% of data
ฮผ ยฑ 2ฯ covers 95.45% of data
ฮผ ยฑ 3ฯ covers 99.73% of data
Skewness measures asymmetry. Kurtosis measures the "tailedness" โ how heavy the tails are compared to a normal distribution.
Positive Skew (right): Mean > Median > Mode โ long right tail
Negative Skew (left): Mean < Median < Mode โ long left tail
Symmetric: Mean = Median = Mode
Models the number of successes in n independent Bernoulli trials, where each trial has probability p of success. Used for quality control, election polling, medical trials.
Mean = np | Variance = np(1โp) | SD = โ[np(1โp)]
Models the number of events occurring in a fixed interval of time or space, when events occur at a constant average rate (ฮป). Used for: calls per hour, accidents per day, typos per page.
Mean = ฮป | Variance = ฮป | (Mean = Variance is a key property!)
Probability is a number between 0 and 1 that measures how likely an event is to occur. P=0 means impossible, P=1 means certain.
P(A') = 1 โ P(A) (Complement Rule)
0 โค P(A) โค 1 (Axiom of Probability)
Bayes' theorem is one of the most powerful ideas in all of statistics. It tells us how to update our prior beliefs when we receive new evidence.
P(H|E) = Posterior (belief after evidence)
P(H) = Prior (initial belief)
P(E|H) = Likelihood (how well H explains E)
P(E) = Marginal (total probability of E)
P(D|+) = P(+|D)รP(D) / P(+) = (0.99ร0.01) / (0.99ร0.01 + 0.01ร0.99) = 0.0099/0.0198 = 50%! Not 99% as most people intuitively assume.
Var(X) = E(Xยฒ) โ [E(X)]ยฒ = ฮฃ xแตขยฒ ร P(xแตข) โ ฮผยฒ
SD(X) = โVar(X)
Pearson's r measures the strength and direction of a linear relationship between two continuous variables. It ranges from โ1 to +1.
r = [nฮฃxy โ ฮฃxฮฃy] / โ{[nฮฃxยฒ โ (ฮฃx)ยฒ][nฮฃyยฒ โ (ฮฃy)ยฒ]}
Spearman's ฯ (rho) is a non-parametric measure based on the ranks of data. Use it when data is ordinal, or when the relationship is monotonic but not necessarily linear.
where dแตข = rank(xแตข) โ rank(yแตข)
Regression finds the best-fit line through data points. The "Ordinary Least Squares" (OLS) method minimises the sum of squared residuals (vertical distances from points to the line).
b (slope) = [nฮฃxy โ ฮฃxฮฃy] / [nฮฃxยฒ โ (ฮฃx)ยฒ] = r ร (ฯy/ฯx)
a (intercept) = ศณ โ bรxฬ
Note: Regression line always passes through (xฬ, ศณ)
In statistics, there are two regression lines: Y on X (used to predict Y given X), and X on Y (used to predict X given Y). They are different unless r = ยฑ1.
X on Y: (xโxฬ) = r(ฯx/ฯy)(yโศณ) โ use to predict X
Both lines intersect at the point (xฬ, ศณ)
Product of regression coefficients = rยฒ โ byx ร bxy = rยฒ
Hypothesis testing is a formal procedure to decide whether sample data provides enough evidence to reject a null hypothesis (Hโ). We never "prove" Hโ true โ we only reject it or fail to reject it.
t-test (unknown ฯ, small n): t = (xฬ โ ฮผโ) / (s/โn), df = nโ1
Chi-square (goodness of fit): ฯยฒ = ฮฃ[(OโE)ยฒ/E]
| Hโ is TRUE | Hโ is FALSE | |
|---|---|---|
| Reject Hโ | Type I Error (ฮฑ) โ False Positive | Correct Decision (Power = 1โฮฒ) |
| Fail to Reject Hโ | Correct Decision (1โฮฑ) | Type II Error (ฮฒ) โ False Negative |
A population is the entire group of interest. A sample is a subset drawn from it. Statistics (from sample) are used to estimate Parameters (of population).
| Measure | Population (Parameter) | Sample (Statistic) |
|---|---|---|
| Mean | ฮผ (mu) | xฬ (x-bar) |
| Variance | ฯยฒ (sigma squared) | sยฒ |
| Std Dev | ฯ (sigma) | s |
| Size | N | n |
| Proportion | P | pฬ |
The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases (n โฅ 30), regardless of the shape of the population distribution. This is why the normal distribution is everywhere!
SE(xฬ) = ฯ/โn (Standard Error of the Mean)
As n โ โ, xฬ ~ N(ฮผ, ฯยฒ/n)
Multiplicative: Y = T ร S ร C ร I (when seasonal variation grows with trend)
A moving average smooths out short-term fluctuations to reveal the underlying trend. A 3-year moving average replaces each value with the average of it and its two neighbours.
For even-period MAs (e.g. 4-point), a second centring average is needed.
Index numbers are specialised averages that measure relative change in a variable (or group of variables) over time or between places. They reduce complex data to a single comparable number. The Consumer Price Index (CPI) measures inflation; the Sensex measures stock market performance.
where Pโ = price in base year, Pโ = price in current year
Paasche: Pa = [ฮฃ(PโQโ) / ฮฃ(PโQโ)] ร 100
Fisher: F = โ(L ร Pa) โ Geometric Mean of L and Pa
Enter prices and quantities for 3 commodities
โข Unit Test: Index should be independent of units of measurement.
โข Time Reversal Test: Pโโ ร Pโโ = 1. Fisher satisfies this; Laspeyres & Paasche don't.
โข Factor Reversal Test: Price index ร Quantity index = Value index. Only Fisher satisfies this.