1. Overview
Classifying statistical data is the process of organizing information into meaningful categories. This is a crucial first step in any statistical investigation because the type of data dictates the appropriate methods for analysis, visualization, and interpretation. Understanding the different types of data (qualitative, quantitative, discrete, and continuous) allows you to choose the correct statistical tools and draw valid conclusions.
Key Definitions
- Data: A collection of facts, such as numbers, words, measurements, or observations.
- Qualitative (Categorical) Data: Non-numerical data that describes qualities or characteristics (e.g., eye color, car brands).
- Quantitative Data: Numerical data that can be counted or measured.
- Discrete Data: Quantitative data that can only take specific, distinct values (e.g., the number of students in a room, shoe sizes).
- Continuous Data: Quantitative data that can take any value within a given range, often measured to a certain degree of accuracy (e.g., height, time, mass).
- Frequency: The number of times a particular value or category occurs in a data set.
Core Content
Types of Data
To classify data correctly, ask yourself: Is it a number? If yes, it is quantitative. Can it be a decimal? If yes, it is usually continuous.
| Data Type | Example | Reasoning |
|---|---|---|
| Qualitative | Types of pets | These are labels/words, not numbers. |
| Discrete | Number of goals | You can score 1 or 2 goals, but not 1.5 goals. |
| Continuous | Weight of a cat | A cat could weigh 4 kg, 4.2 kg, or 4.215 kg depending on the scale. |
Tabulating Discrete Data
When dealing with a list of raw data, we use a Tally Chart to organize it into a Frequency Table.
Worked example 1 — Tabulating Exam Scores
A group of students took a short quiz, and their scores out of 5 were recorded as follows: 3, 4, 3, 2, 5, 3, 1, 4, 3, 2. Create a frequency table to represent this data.
Step 1: List the unique scores in ascending order in the first column.
Step 2: Create a tally column. For each score in the raw data, mark a tally (|) in the corresponding row. After every four tallies, the fifth tally is drawn as a diagonal line across the previous four (////) to make counting easier.
Step 3: Count the tallies for each score and write the total in the frequency column.
| Score | Tally | Frequency |
|---|---|---|
| 1 | ||
| 2 | ||
| 3 | ||
| 4 | ||
| 5 | ||
| Total | 10 |
Tabulating Grouped (Continuous) Data
Continuous data (like height) is often grouped into classes or intervals because the exact values vary too much. We use inequalities to ensure there are no gaps and no overlaps.
Worked Example: Grouping Heights Organize the following heights ($h$) in cm into a frequency table: 152, 168, 155, 171, 159, 163.
Step-by-step working:
- Define the class intervals using notation like $150 < h \leq 160$.
- This interval means: "Heights greater than 150 but less than or equal to 160."
- Place each value into the correct group.
| Height ($h$ cm) | Tally | Frequency |
|---|---|---|
| $150 < h \leq 160$ | ||
| $160 < h \leq 170$ | ||
| $170 < h \leq 180$ |
Notation Check:
- $h \leq 160$: Includes 160.
- $h < 160$: Does not include 160.
- $h > 150$: Does not include 150.
Worked example 2 — Grouping Rainfall Data
The daily rainfall (in mm) for a month was recorded as follows: 2.3, 5.1, 0.0, 7.8, 3.2, 1.5, 6.4, 4.9, 2.8, 0.5, 5.5, 3.9, 1.1, 7.2, 4.4, 2.1, 6.9, 5.8, 3.5, 0.8, 7.5, 4.1, 1.8, 6.1, 5.3, 3.0, 0.2, 7.0, 4.7, 2.5. Create a grouped frequency table using class intervals of width 2mm, starting from 0mm.
Step 1: Determine the class intervals. Since the data ranges from 0.0 to 7.8, and we want intervals of width 2, we can use the following intervals: $0 \le r < 2$, $2 \le r < 4$, $4 \le r < 6$, $6 \le r < 8$. Note the use of $\le$ and $<$ to include the lower bound but exclude the upper bound.
Step 2: Tally each rainfall measurement into the appropriate interval. For example, 2.3 falls into the $2 \le r < 4$ interval.
Step 3: Count the tallies for each interval and record the frequency.
| Rainfall ($r$ mm) | Tally | Frequency |
|---|---|---|
| $0 \le r < 2$ | ||
| $2 \le r < 4$ | ||
| $4 \le r < 6$ | ||
| $6 \le r < 8$ | ||
| Total | 22 |
Extended Content (Extended Only)
While the Core syllabus focuses on basic classification and tabulation, the Extended syllabus expects a deeper understanding of the implications of data types on statistical analysis. Specifically, Extended students should be able to justify the choice of data representation (e.g., using a histogram for continuous data versus a bar chart for discrete data) and understand how different data types affect the calculation and interpretation of measures of central tendency (mean, median, mode) and dispersion (range, interquartile range, standard deviation).
For example, when dealing with grouped continuous data, the mean can only be estimated, as we don't know the exact values within each interval. We use the midpoint of each interval as an approximation. Similarly, the choice of whether to use the interquartile range or standard deviation as a measure of spread depends on the distribution of the data and the presence of outliers. The interquartile range is more robust to outliers, while the standard deviation provides a more complete picture of the data's variability when the data is approximately normally distributed. Understanding these nuances is crucial for conducting more sophisticated statistical analyses.
Worked example 3 — Impact of Outliers on Measures of Dispersion
Consider two datasets representing the ages of participants in a study:
Dataset A: 22, 24, 25, 26, 28 Dataset B: 22, 24, 25, 26, 65
Calculate the range and interquartile range (IQR) for both datasets and compare the results.
Step 1: Calculate the Range The range is the difference between the maximum and minimum values.
For Dataset A: Range = 28 - 22 = 6
For Dataset B: Range = 65 - 22 = 43
Step 2: Calculate the Interquartile Range (IQR) The IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).
For Dataset A: Q1 = 23 Q3 = 27 IQR = 27 - 23 = 4
For Dataset B: Q1 = 23 Q3 = 35.5 IQR = 26 - 23 = 3
Step 3: Compare the Results The range is significantly affected by the outlier (65) in Dataset B, increasing from 6 to 43. However, the IQR is much less affected, changing only slightly. This demonstrates the robustness of the IQR to outliers compared to the range.
Key Equations
There are no specific formulas for classifying data. However, ensure the total frequency ($\sum f$) equals the number of pieces of data provided.
$\sum f = n$
Where:
- $\sum$: Sum of
- $f$: Frequency
- $n$: Total number of data points
This equation is not provided on the formula sheet and must be memorized.
Common Mistakes to Avoid
- ❌ Wrong: Classifying the number of siblings a student has as continuous.
- ✓ Right: The number of siblings is discrete because you can only have whole numbers of siblings (0, 1, 2, etc.). You cannot have 2.5 siblings.
- ❌ Wrong: Creating overlapping groups when tabulating ages, such as $10-20$ and $20-30$. This makes it unclear which group the value "20" belongs to.
- ✓ Right: Use inequalities ($10 \le x < 20$ and $20 \le x < 30$) so that the value "20" only fits into one group.
- ❌ Wrong: Forgetting to account for all data points when creating a frequency table.
- ✓ Right: After completing the frequency table, sum the frequencies and compare the total to the original number of data points. If they don't match, you've missed a value or counted one incorrectly.
- ❌ Wrong: Assuming all numerical data is continuous.
- ✓ Right: Carefully consider whether the numerical data can take on any value within a range (continuous) or only specific, distinct values (discrete).
Exam Tips
- Command Words:
- "Classify": State if the data is qualitative or quantitative (discrete/continuous). Provide a brief justification for your classification.
- "Complete the table": Usually requires you to fill in tally marks or frequencies. Double-check your tallies and calculations.
- Calculator Tip: Most questions in 9.1 are non-calculator as they involve counting and tallying. However, keep your calculator ready to sum large frequency columns to check your work.
- Mark Loss Warning: Students often lose marks for incorrect tallying. Cross out the numbers in the raw data list one by one as you enter them into the tally chart to avoid double-counting or skipping values. Consider rewriting the data in ascending order before tallying to minimize errors.
- Real-world context: Expect questions involving school surveys (number of siblings), factory production (mass of items), or nature (rainfall measurements). Pay attention to the units of measurement and the context of the data when classifying.