I have to submit a term paper
I have to submit a term paper which involves conducting aregression and correlation analysis on any topic of my choosing.The paper must be based on yearly data for any economic or businessvariable, for a period of at least 20 years. The following alsomust be included in the paper: • The term paper should distinguishbetween dependent and independent variables; determine theregression equation by the least squares method; plot theregression line on a scatter diagram; interpret the meaning ofregression coefficients; use the regression equation to predictvalues of the dependent variable for selected values of theindependent variable and construct forecast intervals and calculatethe standard error of estimate, coefficients of determination (r2)and correlation (r) and interpret the meaning of the coefficients(r2) and (r). Your regression and correlation analysis must: 1.Graph the data (scatter diagram) 2. Use the method of least squaresto derive a trend equation and trend values 3. Use check column toverify computations ∑ (Y-Yc)=0 4. Superimpose trend equation onscatter diagram. 5. Use your model to predict the movement of thevariable for the next year. 6. Compare your predictions with theactual behavior of the variable during the 21styear . I am stumpedon what topic to choose and that is where I am looking forguidance. Any help would be greatly appreciated.
Answer:
A good way to start might be to model the relationship with alinear equation, Y = mX + b, where m is the slope of the line and bis the Y-intercept, as you learned in high school algebra. Thequestion now is to determine the best way to estimate m and begiven your pairs of X’s and Y’s.
It turns out that the best way to do this is to use leastsquares regression. We call it least squares regression because theline that we choose will be the one for which the sum of thesquares of the differences between predicted and observed values isas small as possible.
1. Scatter Plot of Y and X1
Scatter plot of sales and calls shows that there can be a lineartrend between the both. The trendline indicates that it looks likehigher the number of calls, higher will be sales
2. Best fit line
Using the Regression option in Excel Data analysis menu, weobtaain the following output
From this, bestfit line equation is Sales=Intercept+Coefficientof Calls *Calls
i.e., Sales = 22.52 + 0.1237 *Calls
3. Coefficient of Correlation
It denotes the strength of association between two variables.The sign denotes the direction of association.
In Exel, we calculate Correlation coefficient asCorrel(X1array,Yarray)
We get the value as 0.318
This means that calls and sales are slightly positivelyassociated. With increase in one quantiity, the other is alsoshowing an increasing trend. Please note that this does not implycausation, i.e.,we CANNOT say that the rise or fall in one iscausing the change in other.
4. Coefficient of Determination
It is more commonly known as R squared value. It gives themeasure of how close the data points are to the best fit line. Inother words, it gives the proportion of variability in dependentvariable that can be explained by the independent variable. Higherthe Rsquared value, better the model is.
From Excel regression output, we get R squared value orCoefficient of Determination as 0.101
~10% of variability in sales is explained by calls.
5. Utility of Regression model
F test can be used to test the utility of the model.
Null Hypothesis: Beta coefficient of call = 0; i.e., Calls isNOT linearly associated with sales
Alternate Hypothesis: Beta coefficient of call 0; Calls islinearly associated with sales
Let us choose significance level, = 0.05.
From the regression ANOVA output, we get p value (orsignificance value) of F test as 0.0012 (<0.05) for the givendegrees of freedom (highlighted)
Since p value < , we can rejectNull hypothesis, thereby concluding that with the givendata it can be said that calls is linearly associated withsales.
6. Based on the above findings, it can be said that calls is agood and important variable in predicting sales volume. It has beenproved that calls and sales have a positive linear associationbetween them. From the best fit line (Sales = 22.52 + 0.1237 *Calls), we can say that with every call, sales increases by0.1237units (interpretation of coefficient of calls).
7. 95% Confidence Interval
The 95% confidence interval for the coefficient of Calls(1) is[0.0498, 0.1976]
Interpretation: 95% confidence interval means that if thisregression analysis is to be repeated for other samples frompopulation, 95% of the intervals will contain the true value of1. In simplerterms, we can say that we are 95% confident that the true value of1 is in ourinterval…