Random Sampling

Gathering Data


There are two ways to gather data from a population:

  1. Census
  2. Sample

A census means to gather data from every member from a population, whilst a sample means to gather data from a group of members from within that population.

Here are the Pros and Cons for each method for the perspective from the business gathering data (this may be questioned):

Sample
ProCon
Less time requiredDoes not represent entire population
Less work requiredLess Accurate / More biased
More Feasible
Census
ProCon
More accurateCostly
Represents entire populationLess feasible
Timely

Central Limit Theorem


When there are more then thirty datapoints in a sample or census, the data increasingly approaches a normal distribution. This theorem is named 'the central limit theorem'. This is very important to know, as it allows us to delve deeper into this distribution, and using our classpad/CAS calculator, extrapolate information from themeasured population.

This theorem has several assumptions:

  1. There are atleast 30 datapoints
  2. The data must be sampled at random
  3. The datapoints must be independant, and not interfer with eachother
  4. Sample sizes must be less then or equal to 10% of the population given sampling is done without replacement (you do not need to know this)

We will look more about this later on when it becomes relavent. For now keep it in the back burner.



Bias


In statistics, a bias refers to any unrepresentative samples or sampling/measurement errors that result in an overestimation or underestimation of a certain population parameter's measurement.

In short, a bias is some incorrect sampling methodology that results in your statistics being WRONG!


Sources of Bias:

Selection Bias:

A selection bias is the fault of the sampler (the one who selects members for the study), who may select groups of individuals that all differ systematically from the broader population. For example, selecting only participants who wake up at 12am to complete a study about 'good organisation', or 'healthy lifestyles'. This results in a sample that differs substantially from the population.

Voluntary Bias:

This form of bias arises from the participation of only members with strong views about a particular subject. This results in a more polarised dataset that does not accurately reflect a population with more broad opinions that arn't all 'hot-takes'.

Acceptability Bias:

This one is mad important. You see this concept everwhere. This defines the modern ostraical world, as disgusting, horrific and orwellian as it is. And it truly is. A person may answer a survey or questionaiire, especially when not anonymous, with answers that fit pop culture responses that they percieve would make them more acceptable by others. The goal of increasing ones social desirability may, for many people (most young people nowadays), will be more valuable then applying their own beliefs. As result, the sample does not reflect the population.

Non-Response Bias:

The people that decide to not respond to a survey or study may all have similar viewpoints. This results in a severe decrease in datapoints of like quality. Thus, the sample does not accurately reflect the population.

Agreeability Bias:

This one is optional. Agreeability is a personality trait defined by a persons fear of confrontation, resulting in a person with high agreeability agreeing with most things without giving any thought. Just saying 'no' means opposing something. This bias has it's roots in both the acceptability bias and response bias, and all around is a bias of one's personality traits.



1) A group of students are asked by several teachers who their favourite teacher is. What are some sources of bias?

Bias is introduced through the non-random selection of students. As this process is non-random, students from the same groups - i.e. those who study similar subjects, may be selected. Thus, the sampled group may contain a higher then normal concentrentation of students within similar classes. Thereby resulting in a biased vote that does not reflect the population.

Bias may also be introduced by the pressuring of self-interest from the teacher running each sample. Selected students may desire affection from the teacher running the sample. Thus, the vote will not represent the population... or even the sample. Thus, the vote is biased.

The vote may also be biased from the teacher selecting students they know would vote for someone. The vote may also be biased from giving students the option to not participate in the survey. The group of students who do not participate may share similar opinions about their favourite teacher. As the sample does not accurately reflect the population, the vote is biased.

As you can see, there is sooooo many oppurtunities for bias. The truth is, the goal of a sampler is never to negate bias, but to minimise it such that the sample most accurately reflects the population.



Sample Proportions


The sample proportion has the formula:

\(p̂ = \frac{X}{n} \)

A sample proportion simply refers to the proportion of a certain metric against the entire sample. Population proportion on the other hand, refers to the proportion of a certain metric against an entire population.

The population proportion (non-changing) has the notation:

\(p\)

The sample proportion has the notation:

\(p̂\)

Trial No.12345678910
PC or PlayStation (PS)PCPSPSPCPCPCPCPCPSPC

2) Using the above table:

a. What is the sample proportion for members who use PlayStation?

\(\begin{aligned} p̂ &= \frac{3}{10} \\[5pt] &= 0.3 \end{aligned} \)

b. What percentage of members use PC:

\(\begin{aligned} p̂ &= \frac{7}{10} \\[5pt] &= 0.7 \\[5pt] &= 70 \% \end{aligned} \)


A population proportion can be used to form a binomial distribution, where the conditions are a measure of the sample size less then thirty.

REMEMBER THAT BINOMIAL DISTRIBUTION FORMS ONLY WITH SMALL SAMPLE SIZES


Sample size < 30: Binomial Distribution


For instance, 70.9% of Australian citizens are born in Australia, whilst the other 29.1% were born overseas. Given any sample size, the proportion of Aussies born overseas will be static: p̂ = 0.291.

Using the Binomial CDF eActivity on the Classpad, we can create a table of the sample:

x number of Australians born overseas12345
Probability \(P(X=x)\)0.36770.30180.12390.02540.0021

We can form another, more useful table by transforming the number \(x\) into the sample probability using the proportion formula \(p̂ = \frac{x}{5}\):

Proportion p̂ of Australians born overseas0.20.40.60.81
Probability \(P(\hat{P}=p̂)\)0.36770.30180.12390.02540.0021

The distribution of X, which was created using the population proportion (a binomial varaible). Features the same formulas as other binomial distributions:

\(E(X) = np\)

\(\sigma^2 = np(1-p)\)

\(\sigma = \sqrt{np(1-p)}\)


Using the formula for sample proportions, we can create formulas for the proportion binomial distribution:

\(\begin{aligned} E(\hat{P}) &= E(\frac{X}{n}) \\[5pt] &= \frac{1}{n} E(X) \\[5pt] &= \frac{1}{n} \times np \\[5pt] &= p \end{aligned} \)

\(\begin{aligned} Var(\hat{P}) &= Var(\frac{X}{n}) \\[5pt] &= \left( \frac{1}{n} \right)^2 \times Var(X) \\[5pt] &= \frac{1}{n^2} \left( np(1-p) \right) \\[5pt] &= \frac{p(1-p)}{n} \end{aligned} \)

\(E(\hat{P}) = p\)

\(Var(\hat{P}) = \frac{p(1-p)}{n} \)

\(Std(\hat{P}) = \sqrt{\frac{p(1-p)}{n}} \)





Sample size ≥ 30: Normal Distribution and Sample Distribution


As the sample size increases, the graph of the distribution \(y = Bin(X = x)\) approximates normal distribution. We can only be confident that this occurs when:

  1. \(n \ge 30\)
  2. \(np \ge 15\)
  3. \(n(1-p) \ge 15\)

I will be graphing the probability P against x. This was achieved by using the Binomial CDF eActivity and cycling through the Lower and Upper limits (values for x):

As you can see, each graph will have a mean of np



Evidently, as the sample size \(n\) increases, it better and better approximates normal distribution (central limit theorem). It will never reach this, as the binomial distribution will never be continous. However, the shape is well enough to perform calcs with normal distribution when the sample size is above 30.



This is the formula for the sample distribution \(\hat{p}\)

\(\hat{p} \sim N \left(p, \sqrt{\frac{p(1-p)}{n}}^2 \right)\)




3) Pencils are produced by a company in large batches, which are then later refined to meet standards, then sorted into packets. A batch is used to produce a sample of 50 pencils. Out of this sample, 27 pencils have snapped tips.

a. What is the proportion of pencils with chipped tips within the sample?

\(\begin{aligned} &= \frac{27}{50} \\[5pt] &= 0.54 \end{aligned} \)

b. Describe the distribution of sample means for pencils with chipped tips?

\(\text{As} \ n \gt 30 \text{, the sample proportion} \ \hat{p} \ \text{will be approximately distributed.} \)

\(\mu = 0.54 \)

\(\begin{aligned} \sigma &= \sqrt{\cfrac{p(1-p)}{n}} \\[5px] &= \sqrt{\cfrac{0.54(1-0.54)}{50}} \\[5px] &\approx 0.07048 \end{aligned} \)

\(\therefore \hat{p} \sim N(0.54, 0.07048^2) \)

c. What is the probability that the mean of the sample would be less then 40%?

\(P(\hat{p} \le 0.4) = 0.0235\)

d. What is the probability that atleast 25 pencils in a sample will contain chipped tips given the true mean is 0.54?

ClassPad > binomialCDF(25, 50, 50, 0.54)

P = 0.7614

Contact/Owner

This website in its entirety is owned, programmed, developed and made public by Aaron Fonte

If you have any bugs, suggestions or statements to make, I welcome you to contact my public email: fonteaaron@protonmail.com. The site is best viewed on PC.

Remove Ads

There are currently no ads to remove, but from the 19th of Feburary, expect ads on the site. These ads would be removed with a subscription of $2.99 a month.

Forums

Get Excited! We are currently working on a forums page.

This forum is to be considered the second half of this website.

Remember, this site is in insaaaanely early dev, so give me a few months and I'll crack on with it.

The idea is, you use the same account for HSHelp as the forum page. The ad-free subscription especially comes in handy for the forums page - as a means of accessing information and contact to other users without ads bothering you. You would need an account to post threads/comments, but anonymous users may read only.

Dependencies

Special Thanks to the following dependencies of this website:

  • JQuery
  • MathJax
  • MHChem (for MathJax)
  • Google Ads, eventually
  • STRIPE API, eventually
  • Our Users :)