Data Analysis and Regression

Analyze data using linear regression, correlation coefficients, and predictive modeling.

advancedstatisticsregressioncorrelationdata-analysishigh-schoolUpdated 2026-02-02

What is Data Analysis?

Data analysis: Process of examining data to find patterns and draw conclusions

Goals:

  • Understand relationships between variables
  • Make predictions
  • Test hypotheses
  • Inform decisions

Key tool: Regression analysis

Scatter Plots

Scatter plot: Graph showing relationship between two variables

x-axis: Independent variable (explanatory) y-axis: Dependent variable (response)

Each point: One observation (x, y)

Example: Study Time vs Test Score

Data:

Study Hours (x):  1    2    3    4    5
Test Score (y):   65   70   75   85   90

Pattern: As study time increases, test score increases

Relationship: Positive correlation

Types of Correlation

Positive correlation: Both variables increase together

Negative correlation: One increases, other decreases

No correlation: No clear pattern

Strength:

  • Strong: Points close to line
  • Weak: Points scattered
  • None: Random cloud

Example: Identify Correlation

Positive: Height vs Weight (taller → heavier)

Negative: Car age vs Value (older → less valuable)

None: Shoe size vs Test score (no relationship)

Correlation Coefficient (r)

Measures strength and direction of linear relationship

Range: -1 ≤ r ≤ 1

Interpretation:

  • r = 1: Perfect positive correlation
  • r = 0: No linear correlation
  • r = -1: Perfect negative correlation

Strength guidelines:

  • |r| > 0.7: Strong
  • 0.3 < |r| < 0.7: Moderate
  • |r| < 0.3: Weak

Example: Interpret r

r = 0.92: Strong positive correlation

r = -0.65: Moderate negative correlation

r = 0.15: Weak positive correlation

r = -0.98: Very strong negative correlation

Calculating Correlation Coefficient

Formula: r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² · Σ(y - ȳ)²]

Usually calculated with calculator or computer

Example: Calculate r

Data:

x:  1   2   3
y:  2   4   5

Step 1: Find means

  • x̄ = (1+2+3)/3 = 2
  • ȳ = (2+4+5)/3 = 3.67

Step 2: Calculate deviations

x - x̄:  -1   0   1
y - ȳ:  -1.67  0.33  1.33

Step 3: Products and squares

(x-x̄)(y-ȳ):  1.67   0   1.33  (sum = 3)
(x-x̄)²:       1      0   1     (sum = 2)
(y-ȳ)²:       2.79   0.11  1.77 (sum = 4.67)

Step 4: Calculate r

r = 3 / √(2 × 4.67)
r = 3 / √9.34
r = 3 / 3.056
r ≈ 0.98

Strong positive correlation

Linear Regression

Goal: Find line that best fits data

Line of best fit (regression line): y = mx + b

Method: Least squares (minimizes distance from points to line)

Example: Find Regression Line

Data:

x:  1   2   3   4   5
y:  2   3   4   4   6

Using formulas (or calculator):

  • Slope: m ≈ 0.9
  • Intercept: b ≈ 1.2

Regression line: y = 0.9x + 1.2

Calculating Slope and Intercept

Slope formula: m = r · (sy / sx)

Where:

  • r = correlation coefficient
  • sy = standard deviation of y
  • sx = standard deviation of x

Intercept formula: b = ȳ - m · x̄

Example: Calculate Regression Line

Data:

x:  2   4   6   8
y:  3   7   8  11

Given:

  • x̄ = 5, ȳ = 7.25
  • sx = 2.58, sy = 3.30
  • r = 0.97

Calculate slope:

m = 0.97 × (3.30 / 2.58)
m = 0.97 × 1.279
m ≈ 1.24

Calculate intercept:

b = 7.25 - 1.24(5)
b = 7.25 - 6.20
b = 1.05

Regression line: y = 1.24x + 1.05

Making Predictions

Use regression line to predict y for given x

Interpolation: Predict within data range (more reliable)

Extrapolation: Predict outside data range (less reliable)

Example: Prediction

Regression line: y = 0.9x + 1.2

Predict y when x = 7:

y = 0.9(7) + 1.2
y = 6.3 + 1.2
y = 7.5

Predicted value: 7.5

Example: Interpolation vs Extrapolation

Data range: x from 1 to 10

Interpolation: Predict for x = 5 (within range) ✓

Extrapolation: Predict for x = 20 (outside range) ⚠

Extrapolation risky: Relationship might change outside observed range

Residuals

Residual: Difference between actual y and predicted y

Formula: Residual = y_actual - y_predicted

Used to assess fit quality

Example: Calculate Residuals

Regression line: y = 2x + 1

Data point: (3, 8)

Predicted: y = 2(3) + 1 = 7

Residual: 8 - 7 = 1

Interpretation: Actual value is 1 unit above predicted

Coefficient of Determination (r²)

= proportion of variance in y explained by x

Range: 0 ≤ r² ≤ 1

Interpretation:

  • r² = 0.81: 81% of variation in y explained by x
  • r² = 0.50: 50% explained
  • r² = 0.10: Only 10% explained

Calculate: r² = (correlation coefficient)²

Example: Interpret r²

r = 0.90

= (0.90)² = 0.81

Meaning: 81% of variation in test scores explained by study hours

Other 19%: Due to other factors (prior knowledge, sleep, etc.)

Outliers

Outlier: Point far from other data points

Effect: Can greatly influence regression line

Should investigate: Might be error or genuine unusual case

Example: Effect of Outlier

Data (without outlier): r = 0.92

Add outlier: (100, 10)

New r: Might drop to 0.70

Always check for outliers before concluding

Causation vs Correlation

IMPORTANT: Correlation ≠ Causation

Strong correlation doesn't prove one causes the other

Possible explanations:

  1. X causes Y
  2. Y causes X
  3. Third variable causes both
  4. Coincidence

Example: Ice Cream and Drowning

Observation: Ice cream sales correlate with drowning deaths

Does ice cream cause drowning? NO

Real cause: Temperature (third variable)

  • Hot weather → more ice cream sales
  • Hot weather → more swimming → more drownings

Multiple Regression

Multiple regression: More than one independent variable

Formula: y = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ

Example: Predict house price using size, bedrooms, age

More complex but more accurate predictions

Real-World Applications

Business: Sales forecasting, market analysis

Medicine: Disease prediction, treatment effectiveness

Economics: Economic indicators, policy impact

Sports: Player performance, game outcomes

Weather: Temperature prediction, climate modeling

Example: Housing Prices

Independent variables:

  • Square footage
  • Number of bedrooms
  • Age of house
  • Distance to city center

Dependent variable:

  • Sale price

Use regression to predict price for new house

Assumptions of Linear Regression

For valid results:

  1. Linear relationship: Data follows roughly straight line
  2. Independence: Observations independent of each other
  3. Homoscedasticity: Constant variance in residuals
  4. Normality: Residuals normally distributed

Check assumptions before trusting results

Technology Tools

Calculator: Can find regression line automatically

Spreadsheets: Excel, Google Sheets

Statistical software: R, Python (pandas, scikit-learn)

Online tools: Many free regression calculators

Practice

Correlation coefficient r = -0.85 indicates:

If r = 0.70, what is r² (coefficient of determination)?

Strong correlation between A and B proves:

Regression line: y = 3x + 2. When x = 5, predicted y = ?