Sunday, 30 September 2012

Theory revision 5

The multiple standard error of estimate is a measure of the effectiveness of the regression equation.
  • It is measured in the same units as the dependent variable.
  • It is difficult to determine what is a large value and what is a small value of the standard error.
  • The independent variables and the dependent variable have a linear relationship.
  • The dependent variable must be continuous and at least interval scale.
  • The variation in (Y-Y') or residual must be the same for all values of Y. When this is the case, we say the difference exhibits homoscedaticity.
  • A residual is the difference between the actual value of Y and the predicted value Y'.
  • Residuals should be approximately normally distributed, Histograms and stem and leaf charts are useful in checking this requirement.
  • The residuals should be normally distributed with mean 0.
  • Successive values of the dependent variable must be uncorrelated.

The ANOVA table 

  • The ANOVA table gives the variation in the dependent variable(of both that which is and is not explained by the regression equation),
  • It is used as a statistical technique or test in detecting the differences in population means or whether or not the means of different groups are all equal when you have more than two population.

Correlation Matrix 

  • A correlation matrix is used to show all possible simple correlation coefficients between all variables.
    • the matrix is useful for locating correlated independent variables.
    • How strongly each independent variable is correlated to the dependent variable is shown in the matrix.

Global Test 

  • The global test is used to investigate whether any of the independent variables have significant coefficients. 
  • The test statistic is the F distribution with k (number of independent variables) and n-(k+1) degree of freedom, where n is the sample size.

Test for individual variables

  • This test is used to determine which independent variables have non zero regression coefficient.
  • The variables that have zero regression coefficients are usually dropped from the analysis.
  • The test statistic is the t distribution with n-(k+1) degrees of freedom.

Qualitative Variables and Stepwise Regression

  • Qualitative variables are non numeric and also called dummy variables
    • For a qualitative variable, there are only two conditions possible.
  • Stepwise Regression leads to the most efficient regression equation
    • Only independent variables with significant regression coefficients are entered into the analysis. Variables are entered in the order in which they increase R^2 the fastest.

Theory revision 4

Hypothesis Testing

  • Testing hypothesis is an essential part of statistical inference.
  • A statistical hypothesis is an assumption about a population. This assumption may or may not be true.
  • For example, claiming that a new drug is better than the current drug for treatment of the same symptons.
  • The best way to determine whether a statistical hypothesis is true would be to examine the entire population.
  • Since that is often impractical, researchers examine a random sample from the population.If sample data are consistent with the statistical hypothesis, the hypothesis is accepted if not, it is rejected.
  • Null Hypothesis: the null hypothesis denoted by H0 is usually the hypothesis that sample observations result purely from chance.
  • Alternative Hypothesis: The alternative hypothesis denoted by H1 is the hypothesis that sample observations are influenced by some non random cause.
  • Statisticians follow a formal process to determine whether to accept or reject a null hypothesis based on sample data. This process, called hypothesis testing consists of five steps.
    1. State null and alternative hypothesis
    2. write down relevant data, select a level of significance
    3. Identify and compute the test statistic, Z to be used in testing the hypothesis.
    4. Compute the critical values Zc.
    5. Based on the sample arrive a decision.
  • Decision errors
    • Type I error: A type I error occurs when the null hypothesis is rejected when it is true. Type I error is called the significance level. This probability is also denoted by Alpha. α
    •  Type II error: A type II error occurs when the researcher accepts a null hypothesis that is false.
  •  One-tailed and two-tailed tests
    • A test of a statistical hypothesis, where the region of rejection is on only one side of the sampling distribution is called a one tailed test.
    • For example, suppose the null hypothesis states that the mean is equal to or more than 10. The alternative hypothesis would be that the mean is less than 10.
    • The region of rejection would consist of a range of numbers located on the left side of sampling distribution that is a set of numbers less than 10.
    • A test of a statistical hypothesis where the region of rejection is on both sides of the sampling distribution is called a two tailed test.
    • For example, suppose the null hypothesis would be that the mean is equal to 10, the alternative hypothesis would be that the mean is less than 10 or greater than 10.
    • The region of rejection would consist of a range of numbers located on both sides of sampling distribution; that is the region of rejection would consist partly of numbers that were less than 10 and partly numbers that were greater than 10.
  • Degree of freedom
    • The concept of degree of freedom is central to the principle of estimating  Statistics of population from samples of them.
    • It is the number of scores that are free to try.

Theory revision 3

Bernuolli Trials 

  • Random with two outcomes (success or failure)
  • Random variable X often coded as 0(failure) and 1(success)
  • Bernoulli trail has probability of success usually denoted p.
  • Accordingly probability of failure (1-p) is ususally denoted
    • q=1-p
    • where x can be zero or one.
    • probability of Bernoulli Distribution is;  

Binomial distribution

  1. identical number of trials
  2. the binomial distribution which consists of a fix number of statistically independent BErnoulli trials.
  3. 2 possible outcome for each trials(success or failure)
  4. each trial is independent(does not affect the others)
  5. probability of success is the same for each trial
  6. Shapes of binomial distribution
    • if p<0.5: the distribution will exhibit positive skew
    • if p=0.5: the distribution will be symmetirc
    • if p>0.5: the distribution will exhibit negative skew

Poisson Random Variable 

  • Poisson random variable represents the number of independent events that occur randomly over unit of times.
  • Count number of times as event occur during a given unit of measurement.
  • Number of events that occur in one unit is independent of other units.
  • Probability that events occurs over given unit is identical for all units.(constant rate)
  • Events occur randomly
  • Expected number of events(rate) in each unit is denoted by λ(lambda)

Saturday, 15 September 2012

Residual Plpt


Analysis of Residuals

  • A residual is the difference between the actual value of Y and the predicted value Y'.
  • Residual should be approximately normally distributed. Histograms and stem and leaf chars are useful in checking this requirement.
  • A plot of the residuals and their corresponding Y' values is used for showing that there are no trends or patterns in the residuals.

Qualitative Variables & Stepwise Regression

  • Qualitative variables are non-numeric and are also called dummy variables.
    • For a qualitative variable, there are only two conditions possible
  • Stepwise Regression leads to the most efficient regression equation.
    • Only independent variables with significant regression coefficients are entered into the analysis. Variables are entered in the order in which they increase R^2 the fastetst.

Test for individual variables

  • This test is used to determine which independent variables have non zero regression coefficients.
  • The variables that have zero regression coefficients are usually dropped from the analysis.
  • The test statistic is the t distribution with n-(k+1) degrees of freedom.

Global Test

  • The global test is used to investigate whether any independent variables have significant coefficients. The hypotheses are:
The test statistic is the F distribution with k ( number of independent variables) and n(k+1) degrees of freedom, where n is the sample size.

Correlation Matrix

  • A correlation matrix is used to show all possible simple correlation coefficients between all variables.
    • The matrix is useful for locating correlated independent variables.
    • How strongly each independent variable is correlated to the dependent variable is shown in the matrix.

The ANOVA table

  • The ANOVA table gives the variation in the dependent variable ( of both that which is and is not explained by the regression equation).
  • It is used as a statistical technique or test in detecting the differences in population means or whether or not the means of different groups are all equal when you have more than two populations.

Multiple Regression and Correlation

  • The independent variables and the dependent variable have a linear relationship.
  • The dependent variable must be continuous and at least interval-scale.
  • The variation in (Y-Y') or residual must be the same for all values of Y. When this is the case, we say the difference exhibits homoscedasticity.
  • The residuals should be normally distributed with mean 0.
  • Successive values of the dependent variable must be uncorrelated.

Multiple Standard Error of Estimante

  • The multiple standard error of estimate is a measure of the effectiveness of the regression equation.
  • It is measured in the same units as the dependent cariable.
  • It is difficult to determine what is a large value and what is a small value of the standard error.
  • The formula is 
    • where n is the number of observations and k is the number of independent variables.

Multiple Regression Analysis

  • For two independent variables, the general form of the multiple regression equation is
    • Y'=a+b1X1+b2X2
  • X1 and X2 are the independent variables.
  • a is the Y intercept
  • b1 is the net change in Y for each unit change in X1
  • holding X2 constant. It's called a partial regression coefficient a net regression coefficient or just a regression coefficient.
  • The general multiple regression with k independent variable is givien by:
  • The least squares criterion is used to develop this eauation.
  • Calculating, b1, b2, etc is very tedious there are many computer software packages that can be used to estimate these parameters

Wednesday, 12 September 2012

Regression Analysis

  • Purpose: to determine the regression equation; it is used to predict the value of the dependent variable (Y) based on the independent variable (X). 
  • Procedure: select a sample from the population and list the paired data for each observation; draw a scatter diagram to give a visual portrayal of the relationship; determine the regression equation.
  • The regression line 
    • A straight line that represents the relationship between two variables 
    • Useful to add to the scatterplot to help us see the direction of the relationship 
    • But it’s much more than this… 
  • Prediction 
    • Regression line enables us to predict Variable Y on the basis of Variable X
    • The slope of the regression line 
    • The amount of change in Y associated with a one-unit change in X
  •  a 
    • The intercept 
    • The point where the regression line crosses the Y axis 
    • The predicted value of Y when X = 0

Rank Order Correlation

  • The correlation coefficient(r) is based on the assumption that data is normally distributed.
  • If the data are skewed, instead of r we use a rank order correlation rs (also called spearman coefficient) as follows.
  • Where d=difference between the ranks of each pair
  • N=the number of pair of observations
  • The spearman coefficient will assume any value between -1.00 to + 1.00 inclusive.
  • -1.00 indicates a perfect negative correlation
  • +1.00 indicates a perfect positive correlation
  • The ranking order has to be done for each pair of data
  • The highest data value will be lowest rank or the other way.
  • The lowest data value will be highest rank or the other way.

t value

  • with (n-2) degree of freedom
  • df= number of points on scatter plot minus 2

Formula for correlation coefficient (r)


Strong Positive Correlation


Zero Correlation


Perfect Positive Correlation


Perfect Negative Correlation


Sunday, 9 September 2012

The coefficient of correlation, r

  • The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. It requires interval or ratio-scaled data (variables). 
  • It can range from -1.00 to 1.00. 
    • Values of -1.00 or 1.00 indicate perfect and strong correlation. 
    • Values close to 0.0 indicate weak correlation. 
    • Negative values indicate an inverse relationship and positive values indicate a direct relationship

Correlation Analysis

  • Correlation Analysis: A group of statistical techniques used to measure the strength of the relationship (correlation) between two variables.
  • Scatter Diagram: A chart that portrays the relationship between the two variables of interest.
  • Dependent Variable: The variable that is being predicted or estimated. 
  • Independent Variable: The variable that provides the basis for estimation. It is the predictor variable.

Saturday, 1 September 2012

The Chi Square Distribution

  • The chi square distribution is asymmetric and its values are always positive.
  • Degrees of freedom are based on the table and are calculated as (rows-1)X(columns-1). Or just (rows-1)

What your table should like


Chi-square Hypothesis Testing

The major characteristics of the chi-square distribution are
  • It is positive skewed
  • It is non-negative
  • It is based on degrees of freedom
  • When the degrees of freedom change a new distribution is created
Hypothesis testing consists of five steps.
  • State Null and alternative hypothesis
  • Write down relevant data, select a level of significance
  • Identify and compute the test statistic x2 to be used in testing they hypothesis.
  • Compute the Critical values x2 C
  • Based on the sample arrive a decision
  • Use the following formula for test stastistic x2 values as follows:

  • Where f0 is the observed frequency and fe is the excepted frequency
  • Then the critical values are determined
  • Chi-square distribution is a family of distribution but each distribution has a different shape depending on the degrees of freedom, df.
  • The data are often presented in a table format. if starting with raw data on two variables, a table must be created first.
  • Columns are scores of the independent variable.
    • There will be as many columns as there are scores in the independent variable.
  • Rows are scores of the dependent variable.
    • There will be as many rows as there are scores on the dependent variable.