Lessons for Statistics, Data Analysis, and Research

Data Hub

Member
Feb 5, 2021
8
45
Nowadays, the world is driven by data. High quality data is a must for development impact. We know that high-quality data is the foundation for meaningful policy-making, efficient resource allocation, and effective public service delivery. We need high quality data to achieve a valid and better output from data analysis. Valid and better results from data analysis lead to useful results from research and project evaluation. Unfortunately, data analysis is faced with some challenges that are related to poor choices of statistical tests as well as poor implementation of the tests.
As a result, I and my team of Dataexpert Statistical Firm decided to organize these sessions to assist those without a statistical background, as well as junior data scientists, in understanding the most perplexing aspects of statistics, data analysis, research, and project evaluation.

These will include the following topics:
  • The fundamental statistical tests​
  • When to use which test​
  • Research writing​
  • Project monitoring & evaluation.​
We categorize this training into chapters as follows:

CHAPTER ONE: Statistical Testing

By passing through the following key points, we will discuss how to select the best test for your data.

  1. Terminologies
  2. Statistical Test (Hypothesis Testing)
  3. Statistical Assumptions
  4. Parametric tests
  5. Flowchart of Parametric Testing
  6. Dealing with non-normal distributions (non-parametric tests)

1. TERMINOLOGIES

Independent and Dependent Variables


An independent variable (predictor variable) is one that is manipulated or controlled to observe its effect on a dependent variable.

Dependent variable (outcome/output variable): As the name implies, its outcome is determined by other variables in the study. The change in the independent variable has an effect on them.​

Examples of independent and dependent variables in a hypothesis:

Example 1. The greater number of coal plants in a region (an independent variable) increases the water population (a dependent variable).

Example 2. What effect does diet or regular soda (independent variable) have on blood sugar levels (dependent variable)?

If you change the independent variable (the type of soda you consume), it will change the dependent variable (blood sugar).​

TYPES OF VARIABLES

It is important to distinguish the differences between the types of variables because this plays a key role in determining the correct type of statistical test to adopt.

There are two main categories:
  1. QUANTITATIVE: express the amounts of things (e.g. the number of cigarettes in a pack).
The two different types of quantitative variables are:
  • CONTINUOUS (Ratio): Is used to describe measures and can usually be divided into units smaller than one (e.g. 1.50 kg).
  • DISCRETE (Interval): Is used to describe counts and usually can’t be divided into units smaller than one (e.g. 1 cigarette).
2. CATEGORICAL: Express groupings of things (e.g. the different type of fruits).

The three different types of categorical variables are:
  • ORDINAL: represent data with an order (e.g. rankings).
  • NOMINAL: represent group names (e.g. brands or species names).
  • BINARY: represent data with a yes/no or 1/0 outcome (e.g. LEFT or RIGHT).

FLOWCHART: INDICATING A TYPE OF VARIABLE

This flowchart gives a summary of the types of variables and their measurements.

1639842741029.png

2. STATISTICAL TESTS

Statistics is all about data Statistics. Data on its own is uninteresting. We are interested in the interpretation of the data.

One of the most important aspects of statistics is statistical testing. If statistics "is the interpretation of the data," statistical testing can be considered as the "formal procedure for investigating our ideas about the world".

In other words, whenever we want to make claims about the distribution of data or whether one set of results is different from another set of results, data scientists must rely on hypothesis testing.

HYPOTHESIS TESTING

Using hypothesis testing, we try to interpret or draw conclusions about the population using sample data, evaluating two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

A hypothesis is a statement that introduces a research question and proposes an expected result. To formulate a promising research hypothesis, you should ask yourself the following questions:
  1. Is the language clear and focused?
  2. What is the relationship between your hypothesis and your research topic?
  3. Is your hypothesis testable?
  4. Does your hypothesis include both an independent and dependent variable?

The questions listed above can be used as a checklist to make sure your hypothesis is based on a solid foundation. It can also assist you in identifying flaws in your hypothesis and revising it as needed.

Steps for to Formulate an Effective Hypothesis Testing:

Step 1. State your Null (Ho) and Alternate (Ha) hypothesis.

Following the development of your initial research hypothesis (the prediction that you want to investigate), it is critical to restate it as a null (Ho) and alternate (Ha) hypothesis so that it can be mathematically tested.

The null hypothesis predicts that there is no relationship between the variables of interest. An alternate hypothesis is a preliminary hypothesis that predicts a relationship between variables.

You want to see if there is a relationship between gender and height. You form a hypothesis based on your knowledge of human physiology that men are, on average, taller than women. To put this hypothesis to the test, rephrase it as follows:

Ho: Men are, on average, not taller than women.
Ha: Men are, on average, taller than women.

Step 2. Collect data in such a way that it can be used to test the hypothesis.

For a statistical test to be valid, sampling and data collection must be done in a way that is designed to test your hypothesis. You cannot make statistical inferences about the population you are interested in if your data is not representative.

To test differences in average height between men and women, your sample should include an equal proportion of men and women, as well as a range of socioeconomic classes and other variables that may influence average height.

Step 3. Apply an appropriate statistical test: Compute the p-value and compare the test to the level of significance.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p-value. This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p-value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of data you have collected. From the data we have collected, there are only two groups: gender (categorical) and height (quantitative). We perform a t-test to test whether men are in fact taller than women.

This test gives us:​
  • An estimate of the difference in average height between the two groups.​
  • A p-value shows how likely you are to see this difference if the null hypothesis of no difference is true.​
The result from the t-test shows an average height of 175.4 cm for men and an average height of 161.7 cm for women. The p-value is 0.002.

Step 4. Decide whether to “Reject” the null hypothesis (Ho) or “Fail to reject” the null hypothesis (Ho).

We must decide whether to reject or fail to reject your null hypothesis based on the results of the statistical test.​

In most cases, we base our decision on the p-value generated by your statistical test. And, in most cases, our cutoff for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that these results would be observed if the null hypothesis were true.

We found that the p-value of 0.002 is less than your cutoff of 0.05 in our analysis of the difference in average height between men and women, so we decided to reject your null hypothesis of no difference.

Step 5: Present your findings

The results of hypothesis testing will be presented in the results and discussion sections of your research paper.

In the results section, you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p-value). In the discussion, you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

Stating results in a statistical assignment
In our comparison of mean height between men and women, we found an average difference of 13.7 cm and a p-value of 0.002. Therefore, we can reject the null hypothesis that men are not taller than women and conclude that there is likely a difference in height between men and women.

However, when presenting research results in academic papers, we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test was consistent or inconsistent with the alternate hypothesis.

3. STATISTICAL ASSUMPTIONS

Statistical tests make some common assumptions about the data being tested (If these assumptions are violated then the test may not be valid: e.g. the resulting p-value may not be correct)​
  • Independence of observations: Independence of observations/variables means that they are not related to one another. If some observations are taken from one farm and others from a different farm, then the observations are not dependent. This implies that all observations in a sample are only counted once.​
  • Homogeneity of variance: The “variance” within each group is being compared should be similar to the rest of the group variance. If a group has a bigger variance than the other(s) this will limit the test’s effectiveness.​
  • Normality of data: The data follows a normal distribution, normality means that the distribution of the test is normally distributed (or bell-shaped) with mean 0, with 1 standard deviation and a symmetric bell-shaped curve.​
4) PARAMETRIC TESTS

Parametric tests are the ones that can only be run with data that stick with the “three statistical assumptions” mentioned above. The most common types of parametric tests are divided into three categories.

Note: Now we are just giving a partial discussion of each category. Detailed of each category will be explained in other session.

Category 1: Regression tests

These tests are used test cause-and-effect relationships, if the change in one or more continuous variable predicts change in another variable.

Types of regression tests

1. Simple linear regression:
Tests how a change in the predictor variable predicts the level of change in the outcome variable.

2. Multiple linear regressions: Tests how changes in the combination of two or more predictor variables predict the level of change in the outcome variable

3. Logistic regression: Is used to describe data and to explain the relationship between one dependent (binary) variable and one or more nominal, ordinal, interval or ratio-level independent variable(s).

Category 2: Comparison tests

These tests look for the difference between the means of variables: Comparison of Means.

Types of comparison tests
  • T-tests are used when comparing the means of precisely two groups (e.g. the average heights of men and women).
  • Independent t-test: Tests the difference between the same variable from different populations (e.g., comparing dogs to cats)
  • ANOVA and MANOVA tests are used to compare the means of more than two groups or more(e.g. the average weights of children, teenagers, and adults).

Category 3: Correlation tests

These tests look for an association between variable, checking whether two variables are related.

Types of correlation tests
  • Pearson Correlation: Tests for the strength of the association between two continuous variables.
  • Spearman Correlation: Tests for the strength of the association between two ordinal variables (it does not rely on the assumption of normally distributed data)
  • Chi-Square Test: Tests for the strength of the association between two categorical variables.

The table below provides a summary of the parametric test.

These tables will assist you in selecting one of the parametric tests described above.
1639844726180.png

5. FLOWCHART: SELECTING A PARAMETRIC TEST

This flowchart will assist you in selecting one of the parametric tests described above
1639844844800.png

6. Handling Non-Normal Distributions
Although the normal distribution is the most commonly used in statistics, many processes use non-normal distributions.
When your data is supposed to fit a normal distribution but does not, we have a few options for dealing with it:
If your sample size is large enough (usually more than 20 items), we may be able to run parametric tests and interpret the results accordingly.
We can transform the data using various statistical techniques to force it to fit a normal distribution.
A non-parametric test may be used if the sample size is small, skewed, or represents another distribution type.​

Non-parametric tests
Non-parametric tests (shown below) make fewer assumptions about the data than parametric tests and are useful when one or more of the three statistical assumptions is violated.

1639844956635.png


The first chapter has come to an end
Questions and suggestions are always welcome
We will go over category one, which is based on regression tests, in the following session.


Afridataexpert Consulting Firm
Email: afridataexpert@gmail.com
Mobile:+255 679926463
 

itakatikiamo

JF-Expert Member
Nov 1, 2014
461
1,000
Nowadays, the world is driven by data. High quality data is a must for development impact. We know that high-quality data is the foundation for meaningful policy-making, efficient resource allocation, and effective public service delivery. We need high quality data to achieve a valid and better output from data analysis. Valid and better results from data analysis lead to useful results from research and project evaluation. Unfortunately, data analysis is faced with some challenges that are related to poor choices of statistical tests as well as poor implementation of the tests.
As a result, I and my team of Dataexpert Statistical Firm decided to organize these sessions to assist those without a statistical background, as well as junior data scientists, in understanding the most perplexing aspects of statistics, data analysis, research, and project evaluation.

These will include the following topics:
  • The fundamental statistical tests​
  • When to use which test​
  • Research writing​
  • Project monitoring & evaluation.​
We categorize this training into chapters as follows:

CHAPTER ONE: Statistical Testing

By passing through the following key points, we will discuss how to select the best test for your data.

  1. Terminologies
  2. Statistical Test (Hypothesis Testing)
  3. Statistical Assumptions
  4. Parametric tests
  5. Flowchart of Parametric Testing
  6. Dealing with non-normal distributions (non-parametric tests)

1. TERMINOLOGIES

Independent and Dependent Variables


An independent variable (predictor variable) is one that is manipulated or controlled to observe its effect on a dependent variable.

Dependent variable (outcome/output variable): As the name implies, its outcome is determined by other variables in the study. The change in the independent variable has an effect on them.​

Examples of independent and dependent variables in a hypothesis:

Example 1. The greater number of coal plants in a region (an independent variable) increases the water population (a dependent variable).

Example 2. What effect does diet or regular soda (independent variable) have on blood sugar levels (dependent variable)?

If you change the independent variable (the type of soda you consume), it will change the dependent variable (blood sugar).​

TYPES OF VARIABLES

It is important to distinguish the differences between the types of variables because this plays a key role in determining the correct type of statistical test to adopt.

There are two main categories:
  1. QUANTITATIVE: express the amounts of things (e.g. the number of cigarettes in a pack).
The two different types of quantitative variables are:
  • CONTINUOUS (Ratio): Is used to describe measures and can usually be divided into units smaller than one (e.g. 1.50 kg).
  • DISCRETE (Interval): Is used to describe counts and usually can’t be divided into units smaller than one (e.g. 1 cigarette).
2. CATEGORICAL: Express groupings of things (e.g. the different type of fruits).

The three different types of categorical variables are:
  • ORDINAL: represent data with an order (e.g. rankings).
  • NOMINAL: represent group names (e.g. brands or species names).
  • BINARY: represent data with a yes/no or 1/0 outcome (e.g. LEFT or RIGHT).

FLOWCHART: INDICATING A TYPE OF VARIABLE

This flowchart gives a summary of the types of variables and their measurements.

View attachment 2048972
2. STATISTICAL TESTS

Statistics is all about data Statistics. Data on its own is uninteresting. We are interested in the interpretation of the data.

One of the most important aspects of statistics is statistical testing. If statistics "is the interpretation of the data," statistical testing can be considered as the "formal procedure for investigating our ideas about the world".

In other words, whenever we want to make claims about the distribution of data or whether one set of results is different from another set of results, data scientists must rely on hypothesis testing.

HYPOTHESIS TESTING

Using hypothesis testing, we try to interpret or draw conclusions about the population using sample data, evaluating two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

A hypothesis is a statement that introduces a research question and proposes an expected result. To formulate a promising research hypothesis, you should ask yourself the following questions:
  1. Is the language clear and focused?
  2. What is the relationship between your hypothesis and your research topic?
  3. Is your hypothesis testable?
  4. Does your hypothesis include both an independent and dependent variable?

The questions listed above can be used as a checklist to make sure your hypothesis is based on a solid foundation. It can also assist you in identifying flaws in your hypothesis and revising it as needed.

Steps for to Formulate an Effective Hypothesis Testing:

Step 1. State your Null (Ho) and Alternate (Ha) hypothesis.

Following the development of your initial research hypothesis (the prediction that you want to investigate), it is critical to restate it as a null (Ho) and alternate (Ha) hypothesis so that it can be mathematically tested.

The null hypothesis predicts that there is no relationship between the variables of interest. An alternate hypothesis is a preliminary hypothesis that predicts a relationship between variables.

You want to see if there is a relationship between gender and height. You form a hypothesis based on your knowledge of human physiology that men are, on average, taller than women. To put this hypothesis to the test, rephrase it as follows:

Ho: Men are, on average, not taller than women.
Ha: Men are, on average, taller than women.

Step 2. Collect data in such a way that it can be used to test the hypothesis.

For a statistical test to be valid, sampling and data collection must be done in a way that is designed to test your hypothesis. You cannot make statistical inferences about the population you are interested in if your data is not representative.

To test differences in average height between men and women, your sample should include an equal proportion of men and women, as well as a range of socioeconomic classes and other variables that may influence average height.

Step 3. Apply an appropriate statistical test: Compute the p-value and compare the test to the level of significance.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p-value. This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p-value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of data you have collected. From the data we have collected, there are only two groups: gender (categorical) and height (quantitative). We perform a t-test to test whether men are in fact taller than women.

This test gives us:​
  • An estimate of the difference in average height between the two groups.​
  • A p-value shows how likely you are to see this difference if the null hypothesis of no difference is true.​
The result from the t-test shows an average height of 175.4 cm for men and an average height of 161.7 cm for women. The p-value is 0.002.

Step 4. Decide whether to “Reject” the null hypothesis (Ho) or “Fail to reject” the null hypothesis (Ho).

We must decide whether to reject or fail to reject your null hypothesis based on the results of the statistical test.​

In most cases, we base our decision on the p-value generated by your statistical test. And, in most cases, our cutoff for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that these results would be observed if the null hypothesis were true.

We found that the p-value of 0.002 is less than your cutoff of 0.05 in our analysis of the difference in average height between men and women, so we decided to reject your null hypothesis of no difference.

Step 5: Present your findings

The results of hypothesis testing will be presented in the results and discussion sections of your research paper.

In the results section, you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p-value). In the discussion, you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

Stating results in a statistical assignment
In our comparison of mean height between men and women, we found an average difference of 13.7 cm and a p-value of 0.002. Therefore, we can reject the null hypothesis that men are not taller than women and conclude that there is likely a difference in height between men and women.

However, when presenting research results in academic papers, we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test was consistent or inconsistent with the alternate hypothesis.

3. STATISTICAL ASSUMPTIONS

Statistical tests make some common assumptions about the data being tested (If these assumptions are violated then the test may not be valid: e.g. the resulting p-value may not be correct)​
  • Independence of observations: Independence of observations/variables means that they are not related to one another. If some observations are taken from one farm and others from a different farm, then the observations are not dependent. This implies that all observations in a sample are only counted once.​
  • Homogeneity of variance: The “variance” within each group is being compared should be similar to the rest of the group variance. If a group has a bigger variance than the other(s) this will limit the test’s effectiveness.​
  • Normality of data: The data follows a normal distribution, normality means that the distribution of the test is normally distributed (or bell-shaped) with mean 0, with 1 standard deviation and a symmetric bell-shaped curve.​
4) PARAMETRIC TESTS

Parametric tests are the ones that can only be run with data that stick with the “three statistical assumptions” mentioned above. The most common types of parametric tests are divided into three categories.

Note: Now we are just giving a partial discussion of each category. Detailed of each category will be explained in other session.

Category 1: Regression tests

These tests are used test cause-and-effect relationships, if the change in one or more continuous variable predicts change in another variable.

Types of regression tests

1. Simple linear regression:
Tests how a change in the predictor variable predicts the level of change in the outcome variable.

2. Multiple linear regressions: Tests how changes in the combination of two or more predictor variables predict the level of change in the outcome variable

3. Logistic regression: Is used to describe data and to explain the relationship between one dependent (binary) variable and one or more nominal, ordinal, interval or ratio-level independent variable(s).

Category 2: Comparison tests

These tests look for the difference between the means of variables: Comparison of Means.

Types of comparison tests
  • T-tests are used when comparing the means of precisely two groups (e.g. the average heights of men and women).
  • Independent t-test: Tests the difference between the same variable from different populations (e.g., comparing dogs to cats)
  • ANOVA and MANOVA tests are used to compare the means of more than two groups or more(e.g. the average weights of children, teenagers, and adults).

Category 3: Correlation tests

These tests look for an association between variable, checking whether two variables are related.

Types of correlation tests
  • Pearson Correlation: Tests for the strength of the association between two continuous variables.
  • Spearman Correlation: Tests for the strength of the association between two ordinal variables (it does not rely on the assumption of normally distributed data)
  • Chi-Square Test: Tests for the strength of the association between two categorical variables.

The table below provides a summary of the parametric test.

These tables will assist you in selecting one of the parametric tests described above.
View attachment 2049001
5. FLOWCHART: SELECTING A PARAMETRIC TEST

This flowchart will assist you in selecting one of the parametric tests described above
View attachment 2049002
6. Handling Non-Normal Distributions
Although the normal distribution is the most commonly used in statistics, many processes use non-normal distributions.
When your data is supposed to fit a normal distribution but does not, we have a few options for dealing with it:
If your sample size is large enough (usually more than 20 items), we may be able to run parametric tests and interpret the results accordingly.
We can transform the data using various statistical techniques to force it to fit a normal distribution.
A non-parametric test may be used if the sample size is small, skewed, or represents another distribution type.​

Non-parametric tests
Non-parametric tests (shown below) make fewer assumptions about the data than parametric tests and are useful when one or more of the three statistical assumptions is violated.



View attachment 2049003

The first chapter has come to an end
Questions and suggestions are always welcome
We will go over category one, which is based on regression tests, in the following session.


Afridataexpert Consulting Firm
Email: afridataexpert@gmail.com
Mobile:+255 679926463
Nimekumbuka mbali mambo ya chi square, Multiple Regression analysis, ukitaka kufaidi statistics ujue na softwares sake kama R studio na SPSS
 

Teknologist

JF-Expert Member
Oct 20, 2018
1,681
2,000
Ohooo.
Watakuambia andika kwa lugha ya taifa tukuelewe....

Nakumbuka Minitab wakati nafanya intership kwenye Quality Engineering Dept. maeneo fulani....
Cp and Cpk......
 

TensorFlow

JF-Expert Member
Feb 11, 2020
512
500
Do you want to deliver highly appealing and scientifically competitive research outputs?

Let's Help You with All Your Research Data Problems: From Data Collection to Data Analysis, and Reports

We employ robust and rigorously-tested algorithms to provide automated data and research solutions with high efficiency, high precision, and unprecedented accuracy, from data collection to data analysis, and reports.

See data visualizations demo:


and sample data and research solutions:


Message us on WhatsApp now: Data & Research Solutions
 

Toa taarifa ya maudhui yasiyofaa!

Kuna taarifa umeiona humu JamiiForums na haifai kubaki mtandaoni?
Fanya hivi...

Umesahau Password au akaunti yako?

Unapata ugumu kuikumbuka akaunti yako? Unakwama kuanzisha akaunti?
Contact us

Top Bottom