## Posts Tagged ‘degrees of freedom’

### Forecast Friday Topic: Seasonal Dummy Variables

September 30, 2010

(Twenty-third in a series)

Last week, I introduced you to the use of dummy variables as a means of incorporating qualitative information into a regression model. Dummy variables can also be used to account for seasonality. A couple of weeks ago, we discussed adjusting your data for seasonality before constructing your model. As you saw, that could be pretty time consuming. One faster approach would be to take the raw time series data and add a dummy variable for each season of the year less one. So, if you’re working with quarterly data, you would want to use three dummy variables; if you have monthly variables, you want to add in 11 dummy variables.

For example, the fourth quarter of the year is often the busiest for most retailers. If a retail chain didn’t seasonally adjust its data, it might choose to create three dummy variables: D1, D2, and D3. The first quarter of the year would be D1; the second quarter, D2 ; and the third quarter, D3. As we discussed last week, we always want to have one fewer dummy variable than we do outcomes. In our example, if we know the fourth quarter is the busiest quarter, then we would expect our three dummy variables to be significant and negative.

Revisiting Billie Burton

A couple of weeks ago, while discussing how to decompose a time series, I used the example of Billie Burton, a businesswoman who makes gift baskets. Billie had been trying to forecast orders for planning and budgeting purposes. She had five years of monthly order data:

 Month TOTAL GIFT BASKET ORDERS 2005 2006 2007 2008 2009 January 15 18 22 26 31 February 30 36 43 52 62 March 25 18 22 43 32 April 15 30 36 27 52 May 13 16 19 23 28 June 14 17 20 24 29 July 12 14 17 20 24 August 22 26 31 37 44 September 20 24 29 35 42 October 14 17 20 24 29 November 35 42 50 60 72 December 40 48 58 70 84

You recall the painstaking effort we went through to adjust Billie’s orders for seasonality. Is there a simpler way? Yes. We can use dummy variables. Let’s first assume Billie ran her regression on the data just as it is, with no adjustment for seasonality. She ends up with the following regression equation:

Ŷ= 0.518t +15.829

This model suggests an upward trend with each passing month but doesn’t fit the data quite as well as we would like: R2 is just 0.313 and the F-statistic is just 26.47.

Imagine now that Billie decides to use seasonal dummy variables. Since her data is monthly, Billie must use 11 dummy variables. Since December is her busiest month, Billie decides to make one dummy variable for each month from January to November. D1 is January; D2 is February; and so on until D11 , which is November. Hence, in January, D1 will be flagged as a 1 and D2 to D11 will be 0. In February, D2 will equal 1 while all the other dummies will be zero. And so forth. Note that all dummies will be zero in December.

Picture in your mind a table with 60 rows and 13 columns. Each row contains the monthly data from January 2005 to December 2009. The first column is the number of orders for the month; the second is the time period, t, which is 1 to 60. That is our independent variable from our original model. The next eleven columns are the dummy variables. Billie enters these into Excel and runs her regression. What does she get?

I’m going to show you the resulting equation in tabular form, as it would look far too complicated in standard form. Billie gets the following output:

 Parameter Coefficients t Stat Intercept 42.93 15.93 t 0.47 12.13 D1 (January) -32.38 -9.88 D2 (February) -10.66 -3.26 D3 (March) -27.73 -8.48 D4 (April) -24.21 -7.41 D5 (May) -36.88 -11.31 D6 (June) -36.35 -11.16 D7 (July) -40.23 -12.35 D8 (August) -26.10 -8.02 D9 (September) -28.58 -8.79 D10 (October) -38.25 -11.76 D11 (November) -7.73 -2.38

Billie gets a great model: notice that all the parameter estimates are significant, and they’re all negative, indicating December as the busiest month. Billie’s R2 has now shot up to 0.919, indicating an even better fit. And the F-statistic is up to 44.73, and it is more significant.

How does this compare to Billie’s model on her seasonally-adjusted data? Recall that when doing her regressions on seasonally adjusted data, Billie got the following results:

Ŷ = 0.47t +17.12

Her model had an R2 of 0.872, but her F-statistic was almost 395! So, even though Billie gained a few more points in R2 with the seasonal dummies, her F-statistic wasn’t quite as significant. However, Billie’s F-statistic using the dummy variables is still very strong, and I would argue more stable. Recall that the F-statistic is determined by dividing the mean squared error of the regression by the mean squared error of the residuals. The mean squared error of the regression is the sum of squares regression (RSS) divided by the number of independent variables in the model; the mean squared error of the residuals is the Sum of Squared Error (SSE) divided by the number of observations less the number of independent variables and less one more. To illustrate, here is a side by side comparison:

 Seasonally Adjusted Model Seasonal Dummy Model # Observations 60 60 SSR 3,982 14,179 # Independent Variables 1 12 Mean Square Error of Regression 3,982 1,182 SSE 585 1,241 Degrees of Freedom 58 47 Mean Squared Error of Residuals 10.08 26.41 F-Statistic 394.91 44.73

So, although the F-statistic is much lower for the seasonal dummy model, the mean square error of the regression is also much lower. As a result, the F-statistic is still quite significant, but much more stable than our one variable model built on the seasonally-adjusted data.

It is important to note that sometimes data sets do not lend themselves well to seasonal dummies, and that the manual adjustment process we worked through a few weeks ago may be a better approach.

Next Forecast Friday Topic: Slope Dummy Variables

The dummy variables we worked with last week and this week are intercept dummies. These dummy variables alter the Y-intercept of the regression equation. Sometimes, it is necessary to affect the slope of the equation. We will discuss how slope dummies are used in next week’s Forecast Friday post.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Analyzing Subgroups of Data

July 21, 2010

The data available to us has never been more voluminous. Thanks to technology, data about us and our environment are collected almost continuously. When we use a cell phone to call someone else’s cell phone, several pieces of information are collected: the two phone numbers involved in the call; the time the call started and ended; the cell phone towers closest to the two parties; the cell phone carriers; the distance of the call; the date; and many more. Cell phone companies use this information to determine where to increase capacity; refine, price, and promote their plans more effectively; and identify regions with inadequate coverage.

Multiply these different pieces of data by the number of calls in a year, a month, a day – even an hour – and you can easily see that we are dealing with enormous amounts of records and observations. While it’s good for decision makers to see what sales, school enrollment, cell phone usage, or any other pattern looks like in total, quite often they are even more interested in breaking down data into groups to see if certain groups behave differently. Quite often we hear decision makers asking questions like these:

• How do depositors under age 35 compare with those between 35-54 and 55 & over in their choice of banking products?
• How will voter support for Candidate A differ by race or ethnicity?
• How does cell phone usage differ between men and women?
• Does the length or severity of a prison sentence differ by race?

When we break data down into subgroups, we are trying to see whether knowing about these groups adds any additional meaningful information. This helps us customize marketing messages, product packages, pricing structures, and sales channels for different segments of our customers. There are many different ways we can break data down: by region, age, race, gender, income, spending levels; the list is limitless.

To give you an example of how data can be analyzed by groups, let’s revisit Jenny Kaplan, owner of K-Jen, the New Orleans-style restaurant. If you recall from the May 25 post, Jenny tested two coupon offers for her \$10 jambalaya entrée: one offering 10% off and another offering \$1 off. Even though the savings was the same, Jenny thought customers would respond differently. As Jenny found, neither offer was better than the other at increasing the average size of the table check. Now, Jenny wants to see if there is a preference for one offer over the other, based on customer age.

Jenny knows that of her 1,000-patron database, about 50% are the ages of 18 to 35; the rest are older than 35. So Jenny decides to send out 1,000 coupons via email as follows:

 \$1 off 10% off Total Coupons 18-35 250 250 500 Over 35 250 250 500 Total Coupons 500 500 1,000

Half of Jenny’s customers received one coupon offer and half received the other. Looking carefully at the table above, half the people in each age group got one offer and the other half got the other offer. At the end of the promotion period, Jenny received back 200 coupons. She tracks the coupon codes back to her database and finds the following pattern:

 Coupons Redeemed (Actual) \$1 off 10% off Coupons Redeemed 18-35 35 65 100 Over 35 55 45 100 Coupons Redeemed 90 110 200

Exactly 200 coupons were redeemed, 100 from each age group. But notice something else: of the 200 people redeeming the coupon, 110 redeemed the coupon offering 10% off; just 90 redeemed the \$1 off coupon. Does this mean the 10% off coupon was the better offer? Not so fast!

What Else is the Table Telling Us?

Look at each age group. Of the 100 customers aged 18-35, 65 redeemed the 10% off coupon; but of the 100 customers age 35 and up, just 45 did. Is that a meaningful difference or just a fluke? Do persons over 35 prefer an offer of \$1 off to one of 10% off? There’s one way to tell: a chi-squared test for statistical significance.

The Chi-Squared Test

Generally, a chi-squared test is useful in determining associations between categories and observed results. The chi-squared – χ2 – statistic is value needed to determine statistical significance. In order to compute χ2, Jenny needs to know two things: the actual frequency distribution of the coupons redeemed (which is shown in the last table above), and the expected frequencies.

Expected frequencies are the types of frequencies you would expect the distribution of data to fall, based on probability. In this case, we have two equal sized groups: customers age 18-35 and customers over 35. Knowing nothing else besides the fact that the same number of people in these groups redeemed coupons, and that 110 of them redeemed the 10% off coupon, and 90 redeemed the \$1 off coupon, we would expect that 55 customers in each group would redeem the 10% off coupon and 45 in each group would redeem the \$1 off coupon. Hence, in our expected frequencies, we still expect 55% of the total customers to redeem the 10% off offer. Jenny’s expected frequencies are:

 Coupons Redeemed (Expected) \$1 off 10% off Coupons Redeemed 18-35 45 55 100 Over 35 45 55 100 Coupons Redeemed 90 110 200

As you can see, the totals for each row and column match those in the actual frequency table above. The mathematical way to compute the expected frequencies for each cell would be to multiply its corresponding column total by its corresponding row total and then divide it by the total number of observations. So, we would compute as follows:

 Frequency of: Formula: Result 18-35 redeeming \$1 off: =(100*90)/200 =45 18-35 redeeming 10% off: =(100*110)/200 =55 Over 35 redeeming \$1 off: =(100*90)/200 =45 Over 35 redeeming 10% off: =(100*110)/200 =55

Now that Jenny knows the expected frequencies, she must determine the critical χ2 statistic to determine significance, then she must compute the χ2 statistic for her data. If the latter χ2 is greater than the critical χ2 statistic, then Jenny knows that the customer’s age group is associated the coupon offer redeemed.

Determining the Critical χ2 Statistic

To find out what her critical χ2 statistic is, Jenny must first determine the degrees of freedom in her data. For cross-tabulation tables, the number of degrees of freedom is a straightforward calculation:

Degrees of freedom = (# of rows – 1) * (# of columns -1)

So, Jenny has two rows of data and two columns, so she has (2-1)*(2-1) = 1 degree of freedom. With this information, Jenny grabs her old college statistics book and looks at the χ2 distribution table in the appendix. For a 95% confidence interval with one degree of freedom, her critical χ2 statistic is 3.84. When Jenny calculates the χ2 statistic from her frequencies, she will compare it with the critical χ2 statistic. If Jenny’s χ2 statistic is greater than the critical, she will conclude that the difference is statistically significant and that age does relate to which coupon offer is redeemed.

Calculating the χ2 Value From Observed Frequencies

Now, Jenny needs to compare the actual number of coupons redeemed for each group to their expected number. Essentially, to compute her χ2 value, Jenny follows a particular formula. For each cell, she subtracts the expected frequency of that cell from the actual frequency, squares the difference, and then divides it by the expected frequency. She does this for each cell. Then she sums up her results to get her χ2 value:

 \$1 off 10% off 18-35 =(35-45)^2/45 = 2.22 =(65-55)^2/55=1.82 Over 35 =(55-45)^2/45 = 2.22 =(45-55)^2/55=1.82 χ2= 2.22+1.82+2.22+1.82 = 8.08

Jenny’s χ2 value is 8.08, much higher than the critical 3.84, indicating that there is indeed an association between age and coupon redemption.

Interpreting the Results

Jenny concludes that patrons over the age of 35 are more inclined than patrons age 18-35 to take advantage of a coupon stating \$1 off; patrons age 18-35 are more inclined to prefer the 10% off coupon. The way Jenny uses this information depends on the objectives of her business. If Jenny feels that K-Jen needs to attract more middle-aged and senior citizens, she should use the \$1 off coupon when targeting them. If Jenny feels K-Jen isn’t selling enough Jambalaya, then she might try to stimulate demand by couponing, sending the \$1 off coupon to patrons over the age of 35 and the 10% off coupon to those 18-35.

Jenny might even have a counterintuitive use for the information. If most of K-Jen’s regular patrons are over age 35, they may already be loyal customers. Jenny might still send them coupons, but give the 10% off coupon instead. Why? These customers are likely to buy the jambalaya anyway, so why not give them the coupon they are not as likely to redeem? After all, why give someone a discount if they’re going to buy anyway! Giving the 10% off coupon to these customers does two things: first, it shows them that K-Jen still cares about their business and keeps them aware of K-Jen as a dining option. Second, by using the lower redeeming coupon, Jenny can reduce her exposure to subsidizing loyal customers. In this instance, Jenny uses the coupons for advertising and promoting awareness, rather than moving orders of jambalaya.

There are several more ways to analyze data by subgroup, some of which will be discussed in future posts. It is important to remember that your research objectives dictate the information you collect, which dictate the appropriate analysis to conduct.

*************************

If you Like Our Posts, Then “Like” Us on Facebook and Twitter!

Analysights is now doing the social media thing! If you like Forecast Friday – or any of our other posts – then we want you to “Like” us on Facebook! By “Like-ing” us on Facebook, you’ll be informed every time a new blog post has been published, or when other information comes out. Check out our Facebook page! You can also follow us on Twitter.

### Forecast Friday Topic: Multiple Regression Analysis

June 17, 2010

(Ninth in a series)

Quite often, when we try to forecast sales, more than one variable is often involved. Sales depends on how much advertising we do, the price of our products, the price of competitors’ products, the time of the year (if our product is seasonal), and also demographics of the buyers. And there can be many more factors. Hence, we need to measure the impact of all relevant variables that we know drive our sales or other dependent variable. That brings us to the need for multiple regression analysis. Because of its complexity, we will be spending the next several weeks discussing multiple regression analysis in easily digestible parts. Multiple regression is a highly useful technique, but is quite easy to forget if not used often.

Another thing to note, regression analysis is often used for both time series and cross-sectional analysis. Time series is what we have focused on all along. Cross-sectional analysis involves using regression to analyze variables on static data (such as predicting how much money a person will spend on a car based on income, race, age, etc.). We will use examples of both in our discussions of multiple regression.

Determining Parameter Estimates for Multiple Regression

When it comes to deriving the parameter estimates in a multiple regression, the process gets both complicated and tedious, even if you have just two independent variables. We strongly advise you to use the regression features of MS-Excel, or some statistical analysis tool like SAS, SPSS, or MINITAB. In fact, we will not work out the derivation of the parameters with the data sets, but will provide you the results. You are free to run the data we provide on your own to replicate the results we display. I do, however, want to show you the equations for computing the parameter estimates for a three-variable (two independent variables and one dependent variable), and point out something very important.

Let’s assume that sales is your dependent variable, Y, and advertising expenditures and price are your independent variables, X1 and X2, respectively. Also, the coefficients – your parameter estimates will have similar subscripts to correspond to their respective independent variable. Hence, your model will take on the form:

Now, how do you go about computing α, β1 and β2? The process is similar to that of a two-variable model, but a little more involved. Take a look:

The subscript “i” represents the individual oberservation.  In time series, the subscript can also be represented with a “t“.

What do you notice about the formulas for computing β1 and β2? First, you notice that the independent variables, X1 and X2, are included in the calculation for each coefficient. Why is this? Because when two or more independent variables are used to estimate the dependent variable, the independent variables themselves are likely to be related linearly as well. In fact, they need to be in order to perform multiple regression analysis. If either β1 or β2 turned out to be zero, then simple regression would be appropriate. However, if we omit one or more independent variables from the model that are related to those variables in the model, we run into serious problems, namely:

Specification Bias (Regression Assumptions Revisited)

Recall from last week’s Forecast Friday discussion on regression assumptions that 1) our equation must correctly specify the true regression model, namely that all relevant variables and no irrelevant variables are included in the model and 2) the independent variables must not be correlated with the error term. If either of these assumptions is violated, the parameter estimates you get will be biased. Looking at the above equations for β1 and β2, we can see that if we excluded one of the independent variables, say X2, from the model, the value derived for β1 will be incorrect because X1 has some relationship with X2. Moreover, X2‘s values are likely to be accounted for in the error terms, and because of its relationship with X1, X1 will be correlated with the error term, violating the second assumption above. Hence, you will end up with incorrect, biased estimators for your regression coefficient, β1.

Omitted Variables are Bad, but Excessive Variables Aren’t Much Better

Since omitting relevant variables can lead to biased parameter estimates, many analysts have a tendency to include any variable that might have any chance of affecting the dependent variable, Y. This is also bad. Additional variables means that you need to estimate more parameters, and that reduces your model’s degrees of freedom and the efficiency (trustworthiness) of your parameter estimates. Generally, for each variable – both dependent and independent – you are considering, you should have at least five data points. So, for a model with three independent variables, your data set should have 20 observations.

Another Important Regression Assumption

One last thing about multiple regression analysis – another assumption, which I deliberately left out of last week’s discussion, since it applies exclusively to multiple regression:

No combination of independent variables should have an exact linear relationship with one another.

OK, so what does this mean? Let’s assume you’re doing a model to forecast the effect of temperature on the speed at which ice melts. You use two independent variables: Celsius temperature and Fahrenheit temperature. What’s the problem here? There is a perfect linear relationship between these two variables. Every time you use a particular value of Fahrenheit temperature, you will get the same value of Celsius temperature. In this case, you will end up with multicollinearity, an assumption violation that results in inefficient parameter estimates. A relationship between independent variables need not be perfectly linear for multicollinearity to exist. Highly correlated variables can do the same thing. For example, independent variables such as “Husband Age” and “Wife Age,” or “Home Value” and “Home Square Footage” are examples of independent variables that are highly correlated.

You want to be sure that you do not put variables in the model that need not be there, because doing so could lead to multicollinearity.

Now Can We Get Into Multiple Regression????

Wasn’t that an ordeal? Well, now the fun can begin! I’m going to use an example from one of my old graduate school textbooks, because it’s good for several lessons in multiple regression. This data set is 25 annual observations to predict the percentage profit margin (Y) for U.S. savings and loan associations, based on changes in net revenues per deposit dollar (X1) and number of offices (X2). The data are as follows:

 Year Percentage Profit Margin (Yt) Net Revenues Per Deposit Dollar (X1t) Number of Offices (X2t) 1 0.75 3.92 7,298 2 0.71 3.61 6,855 3 0.66 3.32 6,636 4 0.61 3.07 6,506 5 0.70 3.06 6,450 6 0.72 3.11 6,402 7 0.77 3.21 6,368 8 0.74 3.26 6,340 9 0.90 3.42 6,349 10 0.82 3.42 6,352 11 0.75 3.45 6,361 12 0.77 3.58 6,369 13 0.78 3.66 6,546 14 0.84 3.78 6,672 15 0.79 3.82 6,890 16 0.70 3.97 7,115 17 0.68 4.07 7,327 18 0.72 4.25 7,546 19 0.55 4.41 7,931 20 0.63 4.49 8,097 21 0.56 4.70 8,468 22 0.41 4.58 8,717 23 0.51 4.69 8,991 24 0.47 4.71 9,179 25 0.32 4.78 9,318

Data taken from Spellman, L.J., “Entry and profitability in a rate-free savings and loan market.” Quarterly Review of Economics and Business, 18, no. 2 (1978): 87-95, Reprinted in Newbold, P. and Bos, T., Introductory Business & Economic Forecasting, 2nd Edition, Cincinnati (1994): 136-137

What is the relationship between the S&Ls’ profit margin percentage and the number of S&L offices? How about between the margin percentage and the net revenues per deposit dollar? Is the relationship positive (that is, profit margin percentage moves in the same direction as its independent variable(s))? Or negative (the dependent and independent variables move in opposite directions)? Let’s look at each independent variable’s individual relationship with the dependent variable.

Net Revenue Per Deposit Dollar (X1) and Percentage Profit Margin (Y)

Generally, if revenue per deposit dollar goes up, would we not expect the percentage profit margin to also go up? After all, if the S & L is making more revenue on the same dollar, it suggests more efficiency. Hence, we expect a positive relationship. So, in the resulting regression equation, we would expect the coefficient, β1, for net revenue per deposit dollar to have a “+” sign.

Number of S&L Offices (X2) and Percentage Profit Margin (Y)

Generally, if there are more S&L offices, would that not suggest either higher overhead, increased competition, or some combination of the two? Those would cut into profit margins. Hence, we expect a negative relationship. So, in the resulting regression equation, we would expect the coefficient, β2, for number of S&L offices to have a “-” sign.

Are our Expectations Correct?

Do our relationship expectations hold up?  They certainly do. The estimated multiple regression model is:

Yt = 1.56450 + 0.23720X1t – 0.000249X2t

What do the Parameter Estimates Mean?

Essentially, the model says that if net revenues per deposit dollar (X1t) increase by one unit, then percentage profit margin (Yt) will – on average – increase by 0.23720 percentage points, when the number of S&L offices is fixed. If the number of offices (X2t) increases by one, then percentage profit margin (Yt) will decrease by an average of 0.000249 percentage points, when net revenues are fixed.

Do Changes in the Independent Variables Explain Changes in The Dependent Variable?

We compute the coefficient of determination, R2, and get 0.865, indicating that changes in the number of S&L offices and in the net revenue per deposit dollar explain 86.5% of the variation in S&L percentage profit margin.

Are the Parameter Estimates Statistically Significant?

We have 25 observations, and three parameters – two coefficients for the independent variables, and one intercept – hence we have 22 degrees of freedom (25-3). If we choose a 95% confidence interval, we are saying that if we resampled and replicated this analysis 100 times, the average of our parameter estimates will be contain the true parameter approximately 95 times. To do this, we need to look at the t-values for each parameter estimate. For a two-tailed 95% significance test with 22 degrees of freedom, our critical t-value is 2.074. That means that if the t-statistic for a parameter estimate is greater than 2.074, then there is a strong positive relationship between the independent variable and the dependent variable; if the t-statistic for the parameter estimate is less than -2.074, then there is a strong negative relationship. This is what we get:

 Parameter Value T-Statistic Significant? Intercept 1.5645000 19.70 Yes B1t 0.2372000 4.27 Yes B2t (0.0002490) (7.77) Yes

So, yes, all our parameter estimates are significant.

Next Forecast Friday: Building on What You Learned

I think you’ve had enough for this week! But we are still not finished. We’re going to stop here and continue with further analysis of this example next week. Next week, we will discuss computing the 95% confidence interval for the parameter estimates; determining whether the model is valid; and checking for autocorrelation. The following Forecast Friday (July 1) blog post will discuss specification bias in greater detail, demonstrating the impact of omitting a key independent variable from the model.