Sie sind auf Seite 1von 60

What is Statistics

The subject of Statistics in different times, has been defined in different manners like: Statistics are the numerical statement of facts capable of analysis and interpretation and the science of statistics is the study of the principles and the methods applied in:
Collecting, Presenting, Analysis and Interpreting

the numerical data in any field of inquiry.

What is Statistics
Definition of Statistics Statistics is the study of how to collect, organize, analyze, and Interpret numerical information from data. Descriptive statistics Involves methods of organizing, picturing and summarizing information from data. Inferential statistics Involves methods of using information from a sample to draw conclusions about the population. Keep in Mind: * Statistical inferences are no more accurate than the data they are based on weakest sources. * Statistical results should be interpreted by one who understands the methods used as well as the subject matter.

What is Data?
DATA Data is a collection of facts, such as values or measurements. It can be numbers, words, measurements, observations or even just descriptions of things. Qualitative data vs Quantitative data Data can be qualitative or quantitative. Qualitative data is descriptive information (it describes something) where Quantitative data, is numerical information (numbers). The Quantitative data can also be Discrete or Continuous: Discrete data can only take certain values (like whole numbers) Continuous data can take any value (within a range) Put simply: Discrete data is counted, Continuous data is measured.

Types of Data
Data
Quantitative (Numeric)
e.g., the height of a person in inches.

Qualitative (Non - Numeric)


e.g., the color of a persons eye.

OBSERVATIONS AND VARIABLES


Observations: In statistics, an observation often means any sort of numerical recording of information, whether it is a physical measurement such as height or weight; a classification such as heads or tails, or an answer to a question such as yes or no. Variable: A characteristic that varies with an individual or an object, is called a variable. For example, age is a variable as it varies from person to person. A variable can assume a number of values.

Variable
A quantity that, varies from an individual to individual.

Variable

Quantitative (Numeric)

Qualitative (Non - Numeric)

QUANTITATIVE & QUALITATIVE VARIABLES


Variables may be classified into quantitative and qualitative according to the form of the characteristic of interest. A variable is called a quantitative variable when a characteristic can be expressed numerically such as age, weight, income or number of children. On the other hand, qualitative variable when the characteristic is non-numerical such as education, sex, eyecolour, quality, intelligence, poverty, satisfaction, etc. A qualitative characteristic is also called an attribute. An individual or an object with such a characteristic can be counted or enumerated after having been assigned to one of the several mutually exclusive (No common points) classes or categories.

Levels of Measurement
1. Nominal Level: (in name only): Qualities with no ranking/ordering; no numerical or quantitative value. Data consists of names, labels and categories, also called categorical data. E.g., Car colors for a certain model are: red, silver, blue and black or 0 = concentrations below reporting limit, 1 = above reporting limit but below a health standard, 2 = above health standard. 2. Ordinal Level: Can be arranged in some numerical order, but the differences between the data values are meaningless. E.g., Of 17 fishing reels rated: 6 were rated good quality, 4 were rated better quality, and 7 were rated best quality.

Levels of Measurement
3. Interval Level: Data values can be ranked and the differences between data values are meaningful. However, there is no intrinsic zero, or starting point, and the ratio of data values are meaningless. E.g., The years in which democrats won presidential elections or temperature. 4. Ratio Level: Similar to interval, except there is an inherent zero, or starting point, and the ratios of data values have meaning. E.g., Length of trout in the North River.

Categorical vs Quantitative Data


Categorical data Nominal, and sometimes ordinal For a single variable, summaries include frequencies and proportions only Quantitative data Interval, ratio, sometimes ordinal Summarize with numerical summaries

EXAMPLES OF DATA
Data are collected in many aspects of everyday life. Statements given to a police officer or physician or
psychologist during an interview are data. The correct and incorrect answers given by a student on a final examination. Almost any athletic event produces data. The time required by a runner to complete a marathon, The number of errors committed by a baseball team in nine innings of play, and so on.

Summary Features of Quantitative data


1. Location (Center, Average) 2. Spread (Variability) 3. Shape (Normal, skewed, etc) 4. Outliers (Unusual values) We use different type of charts, pictures and numerical information to examine these.

Describing Shape for data

Symmetric, bell-shaped Asymmetric or Skewed, not bell-shaped Bimodal: Two prominent peaks (modes) Skewed Right: On number line, values clumped at left end and extend to the right. Skewed Left: On number line, values clumped at right end and extend to the left.

Bell-shaped example

Bimodal Example
Old Faithful Geyser, time between eruptions, histogram

Times between eruptions of the Old Faithful geyser, shape is bimodal. Two clusters, one around 50 min., other around 80 min.

Describing the Location of a Data Set


Mean: the numerical average Median: the middle value (if n odd) or the average of the middle two values (if n even) 1. Symmetric: mean = median 2. Skewed Left: usually mean < median 3. Skewed Right: usually mean > median

Data values skewed to the right

Bell-shaped distribution

Describing Spread with Standard Deviation

Standard deviation measures variability by summarizing how far individual data values are from the mean. Think of the standard deviation as roughly the average distance values fall from the mean.

Describing Spread with Standard Deviation


Deviation: A very simple example Numbers Mean Standard Deviation X 100,100,100,100,100 100 0 Y 90, 90,100,110,110 100 10 Both sets have same mean of 100. Set 1: all values are equal to the mean so there is no variability at all. Set 2: one value equals the mean and other four values are 10 points away from the mean, so the average distance away from the mean is about 10.

Population Standard Deviation

Data sets usually represent a sample from a larger population. If the data set includes measurements for an entire population, the notations for the mean and standard deviation are different, and the formula for the standard deviation is also slightly different. A population mean is represented by the Greek (mu), and the population standard deviation is represented by the Greek sigma (2 , lower case) 2 = (xi )2/n

Data Representation
Tabulation Simple bar chart Component bar chart Multiple bar chart Pie chart

The tree-diagram below presents an outline of the various techniques


TYPES OF DATA

Qualitative
Univariate Frequency Table Bivariate Frequency Table
Component Multiple

Quantitative
Discrete Frequency Distribution Bar Chart Line Chart Continuous Frequency Distribution Histogram Frequency Polygon Frequency Curve

Percentages
Pie Chart Bar Chart

Bar Chart

Qualitative Univariate Frequency Table Percentages Pie Chart Bar Chart Bivariate Frequency Table Component Bar Chart Multiple Bar Chart

Example
Suppose that we are carrying out a survey of the students of first year studying in a co-education. Suppose that in all there are 1200 students of first year in this college. We wish to determine:

What proportion of students have come from Urdu medium


schools? What proportion has come from English medium schools?

Interview Results
We will have an array of observations as follows: U, U, E, U, E, E, E, U, , where (U : URDU MEDIUM) (E : ENGLISH MEDIUM)

Question: What should we do with this type of data?


Obviously, the first thing that comes to mind is to count the number of students who said Urdu medium as well as the number of students who said English medium.

The results can be shown in the following table: Medium of No. of Students Institution (f) Urdu 719

English Total

481 1200

Important: The technical term for the numbers given in the second column of this table is frequency. It means how frequently something happens? Out of the 1200 students, 719 stated that they had come from Urdu medium schools and the remaining 481 from English medium.

Dividing the cell frequencies by the total frequency and multiplying by 100 we obtain the following: Medium of Institution f %

Urdu English

719 481
1200

59.9 = 60% 40.1 = 40% 100

frequency Percentage 100 Total No. of Students

Diagrammatical Representation of Data


A pie chart consists of a circle which is divided into two or more parts in accordance with the number of distinct categories that we have in our data. Medium of f Institution Urdu 719 English 481 1200

Angle 215.70 144.30 3600

English 40%

Urdu 60%

Cell Frequency Division of Circle = 360 Total Frequency

Diagrammatical Representation of Data


SIMPLE BAR CHART A simple bar chart consists of horizontal or vertical bars of equal width and lengths proportional to values they represent. For example; Suppose we have available to us information regarding the turnover of a company for 5 years as given in the table below:
Years Turnover (Rupees) 1965 35,000 1966 42,000 1967 43,500 1968 48,000 1969 48,500

In order to represent the above information in the form of a bar chart, all we have to do is to take the year along the x-axis and construct a scale for turnover along the y-axis.
50,000 40,000

30,000

20,000

10,000

0 1965 1966 1967 1968 1969

Next, against each year, we will draw vertical bars of equal width and different heights in accordance with the turn-over figures that we have in our table.

As a result we obtain a simple and attractive diagram as shown below.

50,000 40,000 30,000 20,000 10,000 0 1965 1966 1967 1968 1969

When our values do not relate to time, they should be arranged in ascending or descending order before-charting.

Bi-variate Frequency Table


What we have just considered was the univariate situation. In each of the two examples, we were dealing with one single variable. In the example of the first year students of a college, our alone variable of interest was medium of schooling. And in the second example, our one single variable of interest was turnover. For example: Suppose that along with the enquiry about the Medium of Institution we are also recording the sex of the student.

Student No. 1 2 3 4 5 6 7 8 : :

Medium U U E U E E U E : :

Gender F M M F M F M M : :

Now this is a bivariate situation; we have two variables, medium of schooling and sex of the student.

Bivariate Frequency Table


In order to summarize the above information, we will construct a table called Bivariate Frequency Table, containing a box head and a stub as shown below:
Sex Med. Urdu English Total
Stub

Male

Female Total

Box Head

Next, we will count the number of students falling in each of the following four categories:

Male student coming from an Urdu medium school. Female student coming from an Urdu medium school. Male student coming from an English medium school. Female student coming from an English medium school.

As a result, suppose we obtain the following figures:

Sex
Med. Urdu English Total

Male 202 350 552

Female 517 131 648

Total 719 481 1200

Bivariate Frequency Table pertaining to two qualitative variables.

COMPONENT BAR CHART This can be accomplish by constructing the component bar chart component bar chart is also known as the subdivided bar chart.

800 700 600 500 400 300 200 100 0 Male

Urdu English

Female

In the above figure, each bar has been divided into two parts. The first bar represents the total number of male students whereas the second bar represents the total number of female students. As far as the medium of schooling is concerned, the lower part of each bar represents the students coming from English medium schools. Whereas the upper part of each bar represents the students coming from the Urdu medium schools. The advantage of this kind of a diagram is that we are able to ascertain the situation of both the variables at a glance. We can compare the number of male students in the college with the number of female students, and at the same time we can compare the number of English medium students among the males with the number of English medium students among the females.

MULTIPLE BAR CHART


Used in a situation where we have two or more related sets of data. Example: Suppose we have information regarding the imports and exports of Pakistan for the years 1970-71 to 1974-75 as shown in the table below:

Years 1970-71

Imports (Crores of Rs.) 370

Exports (Crores of Rs.) 200

1971-72 1972-73 1973-74 1974-75

350 840 1438 2092

337 855 1016 1029

A multiple bar chart is a very useful and effective way of presenting this kind of information. This kind of a chart consists of a set of grouped bars, the lengths of which are proportionate to the values of our variables, and each of which is shaded or colored differently in order to aid identification. With reference to the above example, we obtain the multiple bar chart shown ahead: Multiple Bar Chart representing Imports & Exports of Pakistan ( 1970 - 71 to 1974 - 75)
2500 2000

15 0 0

Im ports Exports

10 0 0

500

0 19 7 0 - 7 1 19 7 1- 7 2 19 7 2 - 7 3 19 7 3 - 7 4 19 7 4 - 7 5

Difference between Component Bar Chart and Multiple Bar Chart


Component Bar Chart
Information available regarding Totals and their components

Multiple Bar Chart


No Information regarding Totals For example:

For Example:
Total no. of male students i.e. English Medium and Urdu Medium

Imports and Exports do not add


up to give you the totality of some one thing.

Population and Sample


Population: The entire collection of individuals or measurements about which information is desired. Sample: A subset of the population selected for study. Primary objective is to create a subset of population whose center, spread and shape are as close as that of population. Methods of sampling: Random sampling, stratified sampling, systematic sampling, cluster sampling, multistage sampling, area sampling, qoata sampling etc.

Statistical Inference

Sample

Population

Statistical inference is the process by which we acquire information about populations from samples. Two types of estimates for making inferences:
Point estimation. Interval estimate.

Point Estimate Vs Interval Estimate


Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means; sample proportions, to estimate population proportions. An estimate of a population parameter may be expressed in two ways: Point estimate. A point estimate of a population parameter is a single value of a statistic. For example, the sample mean x is a point estimate of the population mean . Similarly, the sample proportion p is a point estimate of the population proportion P. Interval estimate. An interval estimate is defined by two numbers, between which a

population parameter is said to lie. For example, a < x < b is an interval estimate of
the population mean . It indicates that the population mean is greater than a but less than b.
1/2/2014

Parameter Vs Statistic
Parameter:

Any statistical characteristic of a population.


Population mean, population median, population standard deviation are examples of parameters. Parameter describes the distribution of a population Parameters are fixed and usually unknown

Parameter Vs Statistic
Statistic: Any statistical characteristic of a sample.

Sample mean, sample median, sample standard deviation


are some examples of statistics. Statistic describes the distribution of population Value of a statistic is known and is varies for different samples Are used for making inference on parameter

Parameter Vs Statistic
Statistical Issue: To describe the distribution of a population through statistic. E.g., sample mean is an estimate of the population mean census or making inference on population distribution/ population parameter using sample distribution/

Type I and Type II errors


No study is perfect, there is always the chance for error

Decision Accept H0 / reject HA Reject H0 /accept HA

H0 true / HA false H0 false / HA true Type II error () OK p=1- Type I error () p= p= OK p=1-
1- - power of the test

- level of significance

Type I and Type II errors


=0.05

there is only 5 chance in 100 that the result termed "significant" could occur by chance alone

The probability of making a Type I () can be decreased by altering the level of significance.

it will be more difficult to find a significant result

the power of the test will be decreased the risk of a Type II error will be increased

Type I and Type II errors


The probability of making a Type II () can be decreased by increasing the level of significance.

it will increase the chance of a Type I error

To which type of error you are willing to risk ?

Type I and Type II errors Example


Suppose there is a test for a particular disease. If the disease really exists and is diagnosed early, it can be successfully treated.
If it is not diagnosed and treated, the person will become severely disabled If a person is erroneously diagnosed as having the disease and treated, no physical damage is done.

To which type of error you are willing to risk ?

Type I and Type II errors, Example


Decision Not diagnosed Diagnosed No disease OK Type I error Disease Type II error OK

treated but not harmed by the treatment

irreparable damage would be done

Decision:

to avoid Type error II, have high level of significance

Steps for testing hypothesis


Hypothesis testing steps: 1. Null (Ho) and alternative (H1) hypothesis specification 2. Selection of significance level () i.e., 0.05 or 0.01 3. Calculating the test statistic e.g., Z-test, t-test, F-test etc. 4. Calculating the probability value (p-value) or confidence Interval? 5. Describing the result and statistic in an understandable way.

Point Estimation
A point estimate draws inference about a population by estimating the value of an unknown parameter using a single value or a point.

Population distribution
?

Parameter

Sample distribution Point estimator

Interval Estimate
An interval estimator draws inferences about a population by

estimating the value of an unknown parameter using an interval.

Population distribution

Parameter

Interval estimator Sample distribution

P-value vs Confidence Interval


Two main ways to assess study precision and the role of chance in a study.
P value measures ( in probability) the evidence against the null hypothesis. An interval within which the value of the parameter lies with a specified probability

E.g. 95% CI implies that if one repeats a study 100 times, the true measure of association will lie inside the CI in 95 out of 100 measures

Outliers and how to handle them


Outlier: a data point that is not consistent with the bulk of the data. Look for them via graphs. Can have big influence on conclusions. Can cause complications in some statistical analyses. Cannot discard without justification. May indicate that the underlying population is skewed, rather than one unique outlier (especially with small samples)

Possible reasons for outliers and what to do about them


1. Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded. They provide important information about location and spread. 2. Mistake made while taking measurement or entering it into computer. If verified, should be discarded or corrected. 3. Individual observation(s) in question belong(s) to a different group than bulk of individuals measured. Values may be discarded if summary is desired and reported for the majority group only.

Bell-shaped distributions

Measurements that have a bell-shape are so common in nature that they are said to have a normal distribution. Knowing the mean and standard deviation completely determines where all of the values fall for a normal distribution, assuming an infinite population! In practice we dont have an infinite population (or sample) but if we have a large sample, we can get good approximations of where values fall.

Das könnte Ihnen auch gefallen