Data Analysis and Regression
Analyze data using linear regression, correlation coefficients, and predictive modeling.
What is Data Analysis?
Data analysis: Process of examining data to find patterns and draw conclusions
Goals:
- Understand relationships between variables
- Make predictions
- Test hypotheses
- Inform decisions
Key tool: Regression analysis
Scatter Plots
Scatter plot: Graph showing relationship between two variables
x-axis: Independent variable (explanatory) y-axis: Dependent variable (response)
Each point: One observation (x, y)
Example: Study Time vs Test Score
Data:
Study Hours (x): 1 2 3 4 5
Test Score (y): 65 70 75 85 90
Pattern: As study time increases, test score increases
Relationship: Positive correlation
Types of Correlation
Positive correlation: Both variables increase together
Negative correlation: One increases, other decreases
No correlation: No clear pattern
Strength:
- Strong: Points close to line
- Weak: Points scattered
- None: Random cloud
Example: Identify Correlation
Positive: Height vs Weight (taller → heavier)
Negative: Car age vs Value (older → less valuable)
None: Shoe size vs Test score (no relationship)
Correlation Coefficient (r)
Measures strength and direction of linear relationship
Range: -1 ≤ r ≤ 1
Interpretation:
- r = 1: Perfect positive correlation
- r = 0: No linear correlation
- r = -1: Perfect negative correlation
Strength guidelines:
- |r| > 0.7: Strong
- 0.3 < |r| < 0.7: Moderate
- |r| < 0.3: Weak
Example: Interpret r
r = 0.92: Strong positive correlation
r = -0.65: Moderate negative correlation
r = 0.15: Weak positive correlation
r = -0.98: Very strong negative correlation
Calculating Correlation Coefficient
Formula: r = Σ[(x - x̄)(y - ȳ)] / √[Σ(x - x̄)² · Σ(y - ȳ)²]
Usually calculated with calculator or computer
Example: Calculate r
Data:
x: 1 2 3
y: 2 4 5
Step 1: Find means
- x̄ = (1+2+3)/3 = 2
- ȳ = (2+4+5)/3 = 3.67
Step 2: Calculate deviations
x - x̄: -1 0 1
y - ȳ: -1.67 0.33 1.33
Step 3: Products and squares
(x-x̄)(y-ȳ): 1.67 0 1.33 (sum = 3)
(x-x̄)²: 1 0 1 (sum = 2)
(y-ȳ)²: 2.79 0.11 1.77 (sum = 4.67)
Step 4: Calculate r
r = 3 / √(2 × 4.67)
r = 3 / √9.34
r = 3 / 3.056
r ≈ 0.98
Strong positive correlation
Linear Regression
Goal: Find line that best fits data
Line of best fit (regression line): y = mx + b
Method: Least squares (minimizes distance from points to line)
Example: Find Regression Line
Data:
x: 1 2 3 4 5
y: 2 3 4 4 6
Using formulas (or calculator):
- Slope: m ≈ 0.9
- Intercept: b ≈ 1.2
Regression line: y = 0.9x + 1.2
Calculating Slope and Intercept
Slope formula: m = r · (sy / sx)
Where:
- r = correlation coefficient
- sy = standard deviation of y
- sx = standard deviation of x
Intercept formula: b = ȳ - m · x̄
Example: Calculate Regression Line
Data:
x: 2 4 6 8
y: 3 7 8 11
Given:
- x̄ = 5, ȳ = 7.25
- sx = 2.58, sy = 3.30
- r = 0.97
Calculate slope:
m = 0.97 × (3.30 / 2.58)
m = 0.97 × 1.279
m ≈ 1.24
Calculate intercept:
b = 7.25 - 1.24(5)
b = 7.25 - 6.20
b = 1.05
Regression line: y = 1.24x + 1.05
Making Predictions
Use regression line to predict y for given x
Interpolation: Predict within data range (more reliable)
Extrapolation: Predict outside data range (less reliable)
Example: Prediction
Regression line: y = 0.9x + 1.2
Predict y when x = 7:
y = 0.9(7) + 1.2
y = 6.3 + 1.2
y = 7.5
Predicted value: 7.5
Example: Interpolation vs Extrapolation
Data range: x from 1 to 10
Interpolation: Predict for x = 5 (within range) ✓
Extrapolation: Predict for x = 20 (outside range) ⚠
Extrapolation risky: Relationship might change outside observed range
Residuals
Residual: Difference between actual y and predicted y
Formula: Residual = y_actual - y_predicted
Used to assess fit quality
Example: Calculate Residuals
Regression line: y = 2x + 1
Data point: (3, 8)
Predicted: y = 2(3) + 1 = 7
Residual: 8 - 7 = 1
Interpretation: Actual value is 1 unit above predicted
Coefficient of Determination (r²)
r² = proportion of variance in y explained by x
Range: 0 ≤ r² ≤ 1
Interpretation:
- r² = 0.81: 81% of variation in y explained by x
- r² = 0.50: 50% explained
- r² = 0.10: Only 10% explained
Calculate: r² = (correlation coefficient)²
Example: Interpret r²
r = 0.90
r² = (0.90)² = 0.81
Meaning: 81% of variation in test scores explained by study hours
Other 19%: Due to other factors (prior knowledge, sleep, etc.)
Outliers
Outlier: Point far from other data points
Effect: Can greatly influence regression line
Should investigate: Might be error or genuine unusual case
Example: Effect of Outlier
Data (without outlier): r = 0.92
Add outlier: (100, 10)
New r: Might drop to 0.70
Always check for outliers before concluding
Causation vs Correlation
IMPORTANT: Correlation ≠ Causation
Strong correlation doesn't prove one causes the other
Possible explanations:
- X causes Y
- Y causes X
- Third variable causes both
- Coincidence
Example: Ice Cream and Drowning
Observation: Ice cream sales correlate with drowning deaths
Does ice cream cause drowning? NO
Real cause: Temperature (third variable)
- Hot weather → more ice cream sales
- Hot weather → more swimming → more drownings
Multiple Regression
Multiple regression: More than one independent variable
Formula: y = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ
Example: Predict house price using size, bedrooms, age
More complex but more accurate predictions
Real-World Applications
Business: Sales forecasting, market analysis
Medicine: Disease prediction, treatment effectiveness
Economics: Economic indicators, policy impact
Sports: Player performance, game outcomes
Weather: Temperature prediction, climate modeling
Example: Housing Prices
Independent variables:
- Square footage
- Number of bedrooms
- Age of house
- Distance to city center
Dependent variable:
- Sale price
Use regression to predict price for new house
Assumptions of Linear Regression
For valid results:
- Linear relationship: Data follows roughly straight line
- Independence: Observations independent of each other
- Homoscedasticity: Constant variance in residuals
- Normality: Residuals normally distributed
Check assumptions before trusting results
Technology Tools
Calculator: Can find regression line automatically
Spreadsheets: Excel, Google Sheets
Statistical software: R, Python (pandas, scikit-learn)
Online tools: Many free regression calculators
Practice
Correlation coefficient r = -0.85 indicates:
If r = 0.70, what is r² (coefficient of determination)?
Strong correlation between A and B proves:
Regression line: y = 3x + 2. When x = 5, predicted y = ?