Posts Tagged ‘statistical modeling’

Forecast Friday Topic: Slope Dummy Variables

October 7, 2010

(Twenty-fourth in a series)

In the last two posts, we discussed the use of dummy variables for factoring the impact of categorical or seasonal phenomena into regression models. Those dummy variables affected the y-intercept of the regression equation. However, many datasets – especially time series – are subject to structural changes that affect the slope – the coefficient – in the regression equation. For example, if you were doing long range forecasting based on several years of data for the airline industry, airline business practices were very different before September 11, 2001 – and you must adjust for it.

Structural changes can also occur in cross-sectional data. If you an operations manager at a factory and were trying to develop a model for worker productivity based on years of experience and education, you might discover that education requirements for factory jobs changed some time ago. Of course, not all current factory workers were affected by the change; some older workers were grandfathered; or union contracts may have shielded these workers from the changes. If for example the newer factory workers were required to obtain a certain amount of college-level training for their work, and you don’t account for the changed requirement, your parameter estimates will be biased.

How do we account for these structural shifts? Slope dummy variables – or slope dummies, for short.

Since the specification of a slope dummy is only slightly more complex than an intercept dummy, I will not be using a full-scale regression example here as I have in past posts. Rather, I will show what a regression model with a slope dummy looks like.

Let’s assume you run a business and your sales are greatly affected by city ordinances – the more ordinances there are, the lower your sales. Your city has two political parties – the Regulation party and the Deregulation party. For the most part, the Regulation party tends to impose more ordinances when they occupy city hall and the Deregulation party tends to impose fewer, or rescind, ordinances when they’re in office.

Of course, sometimes a Regulation administration may not impose new ordinances; and a Deregulation administration may impose them, depending on the policy and economic issues the city is facing at the time. But, for the most part, ordinances tend to increase under Regulation administrations. So how do we account for this?

Let’s start with a simple regression equation:

Ŷ = α -β1Ot + εt

In this equation, Ŷ represents forecasted sales; α is the y-intercept; β1 is the parameter estimate for variable O, which is the number of pages of ordinances on the city’s books in that year; ε is the error term; and t is the time period in the regression. Notice that the parameter estimate, β1, is negative, which we would expect. As the number of pages of ordinances increases, we would expect to see sales go down.

But now you want to account for whether the party in office is a Regulation administration. So you create a dummy variable called Dt. Dt=1 in years when city hall is run by a Regulation mayor and Dt=0 when the city is run by a Deregulation mayor. So your new equation looks like this:

Ŷ = α -β1Ot – β2 Ot Dt + εt

Notice the difference between the slope dummy in this last equation? It’s multiplied against the pages of regulation and then run as an independent variable in the model. Hence, we forecast our sales as follows:

When a Deregulation administration is in city hall (Dt =0), Ŷ=
α -β1Ot;

When a Regulation administration is in city hall (Dt =1), Ŷ=
α –(β12)Ot

Hence, you see the slope (parameter estimate) is different if the Regulation party is in office.

Next Forecast Friday Topic: Selecting Variables for a Regression Model

Sometimes you have a lot of variables to choose from when building a regression model. How do you know which ones to include in your model? We will discuss some approaches to determine which variables to enter into your model next week.

************************

Check out Analysights’ profile at the Janlong Communications Blog!

Marketing communications specialist Janice Long of Janlong Communications profiles many small and up-and-coming businesses on her blog, asking owners their ideal client profile and how they got started. This week, Janlong Communications profiled Analysights! See the brief post about how we got started and the market niche we serve.

Advertisements

Forecast Friday Topic: Multicollinearity – How to Detect it; How to Correct it

July 15, 2010

(Thirteenth in a series)

In last week’s Forecast Friday post, we explored how to perform regression analysis using Excel. We looked at the giving history of 20 contributors to a nonprofit organization, and developed a model based on the recency, frequency, and monetary value (RFM) of their past donations. We derived the following regression equation:

We were pleased to see that our model had a coefficient of determination – or R2=0.933, indicating that our model explained 93.3% of the change in the donor’s current contribution (our Ŷ). But we were a little disheartened when we looked at the t-statistics of each of our regression coefficients. Recall that we found our recency coefficient was not significant:

Parameter

Coefficient

T-statistic

Significant?

Intercept

87.27

4.32

Yes

Months since Last

(1.80)

(1.44)

No

Times Donated

2.45

2.87

Yes

Average Contribution

0.35

3.26

Yes

Yet, most direct marketing professionals know clearly that RFM theory postulates that all three variables are significant indicators of whether and how much a donor will give (or a customer will buy). When our model doesn’t replicate what a tried and true theory has long maintained, there could possibly be something wrong.

Multicollinearity

Most times, when something doesn’t look right in the results of a regression model, it is safe to assume that one of the regression assumptions has been violated. The problem is trying to determine which assumption – or assumptions – was violated. Since the coefficient for “Months Since Last Contribution” has a t-statistic that indicates it isn’t statistically significant, we might suspect that the specification assumption is violated: that is, we may believe that “Months Since Last Contribution” is an extraneous, irrelevant variable that should not have been included in the model and, thus, be removed.

But is that really the case? There can be other reasons why a parameter estimate does not come up significant. If two or more independent variables are highly correlated, the resulting multicollinearity can cause the regression model to assign a statistically insignificant parameter estimate to an important independent variable. So, how can we detect multicollinearity?

Detecting Multicollinearity: Correlation Matrix

The first step in detecting multicollinearity is to examine the correlation among the independent variables. We do this by looking at a correlation matrix. You can run a correlation matrix in Excel by using its Data Analysis ToolPak. Looking at the correlation matrix for our variables, we find:

Correlation Matrix – Original Variables

Variable

Contribution Y

Months Since Last Donation X1

Times Donated in last 12 months

X2

Average Contribution in last 12 months

X3

Contribution (Y)

1.00

  

  

  

Months Since Last Donation – X1

-0.93

1.00

  

  

Times Donated in last 12 months – X2

0.89

-0.88

1.00

  

Average Contribution Last 12 mo. – X3

0.88

-0.84

0.69

1.00

 

A correlation of 1.00 means two variables are perfectly correlated; a correlation of 0.00 means there is absolutely no correlation. The cells in the matrix above, where the correlation is 1.00, shows the correlation of an independent variable with itself – we would expect a perfectly correlated relationship. What is most important to us are the numbers below the 1.00 correlations. The first column shows our dependent variable, “Contribution”. As you go down the column, row by row, you see that each of our independent variables is strongly correlated with the dependent variable, indicating that they are all strong predictors.

The correlation between “Months Since Last Donation” (X1) and the donor’s Contribution (Y) shows a correlation that is almost perfectly negative (-0.93), while those correlations of the dependent variable with each of the other two independent variables is almost perfectly positive with the contribution (0.89 and 0.88). When writing these in shorthand, we use the Greek letter rho, ρ, to denote correlation. Hence, to show the correlation between each independent variable with the dependent variable, we would express them as follows:

ρX1Y = -0.93

ρX2Y = 0.89

ρX3Y = 0.88

But now, let’s look at the correlations among our independent variables:

ρX1X2= -0.88

ρX1X3= -0.84

ρX2X3= 0.69

 

Notice that all of our independent variables are highly correlated with one another. The relationship between “Times Donated in Last 12 Months” and “Average Contribution in Last 12 Months” is not as strong as the correlation between those individual variables with “Months Since Last Donation,” but the correlation is still very strong.

Hence, we can conclude that multicollinearity is present in this model.

Correcting Multicollinearity: Dropping Variables

In today’s post, we will discuss one of the remedies for multicollinearity – dropping a highly correlated independent variable. Next week, we’ll discuss the other approaches to correcting multicollinearity. Sometimes, when a variable is “iffy,” we can save ourselves some trouble and just kick it out. If we were to ignore “Months Since Last Donation,” and run our regression with the remaining two variables, we end up with the following regression equation:

Ŷ= 60.68 + 3.37X2 + 0.45X3

We get R2 =0.924, suggesting that we didn’t lose much explanatory power by excluding “Months Since Last Donation.” We also get an F statistic of 103.36, much higher than the 73.90 we had in our original model. A higher F-statistic indicates a model that is more statistically valid. It also reflects the exclusion of one or more extraneous variables. Also, the t-statistics for both independent variables are significant, and they’re even higher than they were in the original model, further indicating increased validity:

Parameter

Coefficient

T-statistic

Significant?

Intercept

60.68

7.24

Yes

Times Donated

3.37

5.83

Yes

Average Contribution

0.45

5.49

Yes

Dropping “Months Since Last Donation” from our analysis worked here. However, dropping variables without a rational decision process can cause new problems. In some cases, dropping a variable can result in specification bias, as we saw in our previous example of predicting profit margin for savings and loan associations a few weeks ago. So, consider dropping variables cautiously.

Next Forecast Friday Topic: More Multicollinearity Remedies

Today, we described one of the ways to remedy multicollinearity – dropping variables. Next week, we will explore two other ways of correcting multicollinearity: obtaining more data and transforming variables. We will also discuss the pitfalls of all three of these remedies, and we will discuss when it’s not worth it to reduce the impact of multicollinearity.

*************************************

Let Analysights Take the Pain out of Forecasting!

Multicollinearity is but one of the many problems you can encounter when forecasting. Let Analysights walk you through the forecasting process so that you can spend more time making strategic decisions and less time trying to guess first where business is going. We will make your forecasting efforts seamless, so you can concentrate on running your business. Check out our Web site or call (847) 895-2565.

Forecast Friday Topic: Multiple Regression Analysis (continued)

June 24, 2010

(Tenth in a series)

Today we resume our discussion of multiple regression analysis. Last week, we built a model to determine the extent of any relationship between U.S. savings & loan associations’ percent profit margin and two independent variables, net revenues per deposit dollar and number of S&L offices. Today, we will compute the 95% confidence interval for each parameter estimate; determine whether the model is valid; check for autocorrelation; and use the model to forecast. Recall that our resulting model was:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

Where Yt is the percent profit margin for the S&L in Year t; X1t is the net revenues per deposit dollar in Year t; and X2t is the number of S&L offices in the U.S. in Year t. Recall that the R2 is .865, indicating that 86.5% of the change in percentage profit margin is explained by changes in net revenues per deposit dollar and number of S&L offices.

Determining the 95% Confidence Interval for the Partial Slope Coefficients

In multiple regression analysis, since there are multiple independent variables, the parameter estimates for each independent variable both impact the slope of the line; hence the coefficients β1t and β2t are referred to as partial slope estimates. As with simple linear regression, we need to determine the 95% confidence interval for each parameter estimate, so that we could get an idea where the true population parameter lies. Recall from our June 3 post, we did that by determining the equation for the standard error of the estimate, sε, and then the standard error of the regression slope, sb. That worked well for simple regression, but for multiple regression, it is more complicated. Unfortunately, deriving the standard error of the partial regression coefficients requires the use of linear algebra, and would be too complicated to discuss here. Several statistical programs and Excel compute these values for us. So, we will state the values of sb1 and sb2 and go from there.

Sb1=0.05556

Sb2=0.00003

Also, we need our critical-t value for 22 degrees of freedom, which is 2.074.

Hence, our 95% confidence interval for β1 is denoted as:

0.23720 ± 2.074 × 0.05556

=0.12197 to 0.35243

Hence, we are saying that we can be 95% confident that the true parameter β1 lies somewhere between the values of 0.12197 and 0.35243.

Similarly, for β2, the procedure is similar:

-0.000249 ± 2.074 × 0.00003

=-0.00032 to -0.00018

Hence, we can be 95% confident that the true parameter β2 lies somewhere between the values of -0.00032 and -0.00018. Also, the confidence interval for the intercept, α, ranges from 1.40 to 1.73.

Note that in all of these cases, the confidence interval does not contain a value of zero within its range. The confidence intervals for α and β1 are positive; that for β2 is negative. If any parameter’s confidence interval ranges crossed zero, then the parameter estimate would not be significant.

Is Our Model Valid?

The next thing we want to do is determine if our model is valid. When validating our model we are trying to prove that our independent variables explain the variation in the dependent variable. So we start with a hypothesis test:

H0: β1 = β2 = 0

HA: at least one β ≠ 0

Our null hypothesis says that our independent variables, net revenue per deposit dollar and number of S&L offices, explain nothing of the variation in an S&L percentage profit margin, and hence, that our model is not valid. Our alternative hypothesis says that at least one of our independent variable explains some of the variation in an S&L’s percentage profit margin, and thus is valid.

So how do we do it? Enter the F-test. Like the T-test, the F-test is a means for hypothesis testing. Let’s first start by calculating our F-statistic for our model. We do that with the following equation:

Remember that RSS is the regression sum of squares and ESS is the error sum of squares. The May 27th Forecast Friday post showed you how to calculate RSS and ESS. For this model, our RSS=0.4015, and our ESS=0.0625; k is the number of independent variables, and n is the sample. Our equation reduces to:


= 70.66

If our Fcalc is greater than the critical F value for the distribution, then we can reject our null hypothesis and conclude that there is strong evidence that at least one of our independent variables explains some of the variation in an S&L’s percentage profit margin. How do we determine our critical F? There is yet another table in any statistics book or statistics Web site called the “F Distribution” table. In it, you look for two sets of degrees of freedom – one for the numerator and one for the denominator of your Fcalc equation. In the numerator, we have two degrees of freedom; in the denominator, 22. So we look at the F Distribution table notice the columns represent numerator degrees of freedom, and the rows, denominator degrees of freedom. When we find column (2), row (22), we end up with an F-value of 5.72.

Our Fcalc is greater than that, so we can conclude that our model is valid.

Is Our Model Free of Autocorrelation?

Recall from our assumptions that none of our error terms should be correlated with one another. If they are, autocorrelation results, rendering our parameter estimates inefficient. Check for autocorrelation, we need to look at our error terms, when we compare our predicted percentage profit margin, Ŷ, with our actual, Y:

Year

Percentage Profit Margin

Actual (Yt)

Predicted by Model (Ŷt)

Error

1

0.75

0.68

(0.0735)

2

0.71

0.71

0.0033

3

0.66

0.70

0.0391

4

0.61

0.67

0.0622

5

0.7

0.68

(0.0162)

6

0.72

0.71

(0.0124)

7

0.77

0.74

(0.0302)

8

0.74

0.76

0.0186

9

0.9

0.79

(0.1057)

10

0.82

0.79

(0.0264)

11

0.75

0.80

0.0484

12

0.77

0.83

0.0573

13

0.78

0.80

0.0222

14

0.84

0.80

(0.0408)

15

0.79

0.75

(0.0356)

16

0.7

0.73

0.0340

17

0.68

0.70

0.0249

18

0.72

0.69

(0.0270)

19

0.55

0.64

0.0851

20

0.63

0.61

(0.0173)

21

0.56

0.57

0.0101

22

0.41

0.48

0.0696

23

0.51

0.44

(0.0725)

24

0.47

0.40

(0.0746)

25

0.32

0.38

0.0574

The next thing we need to do is subtract the previous period’s error from the current period’s error. After that, we square our result. Note that we will only have 24 observations (we can’t subtract anything from the first observation):

Year

Error

Difference in Errors

Squared Difference in Errors

1

(0.07347)

  

  

2

0.00334

0.07681

0.00590

3

0.03910

0.03576

0.00128

4

0.06218

0.02308

0.00053

5

(0.01624)

(0.07842)

0.00615

6

(0.01242)

0.00382

0.00001

7

(0.03024)

(0.01781)

0.00032

8

0.01860

0.04883

0.00238

9

(0.10569)

(0.12429)

0.01545

10

(0.02644)

0.07925

0.00628

11

0.04843

0.07487

0.00561

12

0.05728

0.00884

0.00008

13

0.02217

(0.03511)

0.00123

14

(0.04075)

(0.06292)

0.00396

15

(0.03557)

0.00519

0.00003

16

0.03397

0.06954

0.00484

17

0.02489

(0.00909)

0.00008

18

(0.02697)

(0.05185)

0.00269

19

0.08509

0.11206

0.01256

20

(0.01728)

(0.10237)

0.01048

21

0.01012

0.02740

0.00075

22

0.06964

0.05952

0.00354

23

(0.07252)

(0.14216)

0.02021

24

(0.07460)

(0.00208)

0.00000

25

0.05738

0.13198

0.01742

 

If we sum up the last column, we will get .1218, if we then divide that by our ESS of 0.0625, we get a value of 1.95. What does this mean?

We have just computed what is known as the Durbin-Watson Statistic, which is used to detect the presence of autocorrelation. The Durbin-Watson statistic, d, can be anywhere from zero to 4. Generally, when d is close to zero, it suggests the presence of positive autocorrelation; a value close to 2 indicates no autocorrelation; while a value close to 4 indicates negative autocorrelation. In any case, you want your Durbin-Watson statistic to be as close to two as possible, and ours is.

Hence, our model seems to be free of autocorrelation.

Now, Let’s Go Forecast!

Now that we have validated our model, and saw that it was free of autocorrelation, we can be comfortable forecasting. Let’s say that for years 26 and 27, we have the following forecasts for net revenues per deposit dollar, X1t and number of S&L offices, X2t. They are as follows:

X1,26 = 4.70 and X2,26 = 9,350

X1,27 = 4.80 and X2,27 = 9,400

Plugging each of these into our equations, we generate the following forecasts:

Ŷ26 = 1.56450 + 0.23720 * 4.70 – 0.000249 * 9,350

=0.3504

Ŷ27 = 1.56450 + 0.23720 * 4.80 – 0.000249 * 9,400

=0.3617

Next Week’s Forecast Friday Topic: The Effect of Omitting an Important Variable

Now that we’ve walked you through this process, you know how to forecast and run multiple regression. Next week, we will discuss what happens when a key independent variable is omitted from a regression model and all the problems it causes when we violate the regression assumption that “all relevant and no irrelevant independent variables are included in the model.” Next week’s post will show a complete demonstration of such an impact. Stay tuned!

Forecast Friday Topic: Multiple Regression Analysis

June 17, 2010

(Ninth in a series)

Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.

Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.

Determining Parameter Estimates for Multiple Regression

When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.

Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X1 and X2, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form:

 

Now, how do you go about computing α, β1 and β2? The process is similar to that of a two-variable model, but a little more involved. Take a look:

The subscript “i” represents the individual oberservation.  In time series, the subscript can also be represented with a “t“.

What do you notice about the formulas for computing β1 and β2? First, you notice that the independent variables, X1 and X2, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β1 or β2 turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:

Specification Bias (Regression Assumptions Revisited)

Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β1 and β2, we can see that if we excluded one of the independent variables, say X2, from the model, the value derived for β1 will be incorrect because X1 has some relationship with X2. Moreover, X2‘s values are likely to be accounted for in the error terms, and because of its relationship with X1, X1 will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β1.

Omitted Variables are Bad, but Excessive Variables Aren’t Much Better

Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.

Another Important Regression Assumption

One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:

No combination of independent variables should have an exact linear relationship with one another.

OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.

You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.

Now Can We Get Into Multiple Regression????

Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X1) and number of offices (X2). The data are as follows:

Year

Percentage Profit Margin (Yt)

Net Revenues Per Deposit Dollar (X1t)

Number of Offices (X2t)

1

0.75

3.92

7,298

2

0.71

3.61

6,855

3

0.66

3.32

6,636

4

0.61

3.07

6,506

5

0.70

3.06

6,450

6

0.72

3.11

6,402

7

0.77

3.21

6,368

8

0.74

3.26

6,340

9

0.90

3.42

6,349

10

0.82

3.42

6,352

11

0.75

3.45

6,361

12

0.77

3.58

6,369

13

0.78

3.66

6,546

14

0.84

3.78

6,672

15

0.79

3.82

6,890

16

0.70

3.97

7,115

17

0.68

4.07

7,327

18

0.72

4.25

7,546

19

0.55

4.41

7,931

20

0.63

4.49

8,097

21

0.56

4.70

8,468

22

0.41

4.58

8,717

23

0.51

4.69

8,991

24

0.47

4.71

9,179

25

0.32

4.78

9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.

Net Revenue Per Deposit Dollar (X1) and Percentage Profit Margin (Y)

Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β1, for net revenue per deposit dollar to have a “+” sign.

Number of S&L Offices (X2) and Percentage Profit Margin (Y)

Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β2, for number of S&L offices to have a “-” sign.

Are our Expectations Correct?

Do our relationship expectations hold up?  They certainly do. The estimated multiple regression model is:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

What do the Parameter Estimates Mean?

Essentially, the model says that if net revenues per deposit dollar (X1t) increase by one unit, then percentage profit margin (Yt) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X2t) increases by one, then percentage profit margin (Yt) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.

Do Changes in the Independent Variables Explain Changes in The Dependent Variable?

We compute the coefficient of determination, R2, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.

Are the Parameter Estimates Statistically Significant?

We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:

Parameter

Value

T-Statistic

Significant?

Intercept

1.5645000

19.70

Yes

B1t

0.2372000

4.27

Yes

B2t

(0.0002490)

(7.77)

Yes

So, yes, all our parameter estimates are significant.

Next Forecast Friday: Building on What You Learned

I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.