9.1 BETA

Classifying statistical data

1 learning objectives

1. Overview

Classifying statistical data is the process of organizing information into meaningful categories. This is a crucial first step in any statistical investigation because the type of data dictates the appropriate methods for analysis, visualization, and interpretation. Understanding the different types of data (qualitative, quantitative, discrete, and continuous) allows you to choose the correct statistical tools and draw valid conclusions.


Key Definitions

  • Data: A collection of facts, such as numbers, words, measurements, or observations.
  • Qualitative (Categorical) Data: Non-numerical data that describes qualities or characteristics (e.g., eye color, car brands).
  • Quantitative Data: Numerical data that can be counted or measured.
  • Discrete Data: Quantitative data that can only take specific, distinct values (e.g., the number of students in a room, shoe sizes).
  • Continuous Data: Quantitative data that can take any value within a given range, often measured to a certain degree of accuracy (e.g., height, time, mass).
  • Frequency: The number of times a particular value or category occurs in a data set.

Core Content

Types of Data

To classify data correctly, ask yourself: Is it a number? If yes, it is quantitative. Can it be a decimal? If yes, it is usually continuous.

Data Type Example Reasoning
Qualitative Types of pets These are labels/words, not numbers.
Discrete Number of goals You can score 1 or 2 goals, but not 1.5 goals.
Continuous Weight of a cat A cat could weigh 4 kg, 4.2 kg, or 4.215 kg depending on the scale.

Tabulating Discrete Data

When dealing with a list of raw data, we use a Tally Chart to organize it into a Frequency Table.

Worked example 1 — Tabulating Exam Scores

A group of students took a short quiz, and their scores out of 5 were recorded as follows: 3, 4, 3, 2, 5, 3, 1, 4, 3, 2. Create a frequency table to represent this data.

Step 1: List the unique scores in ascending order in the first column.

Step 2: Create a tally column. For each score in the raw data, mark a tally (|) in the corresponding row. After every four tallies, the fifth tally is drawn as a diagonal line across the previous four (////) to make counting easier.

Step 3: Count the tallies for each score and write the total in the frequency column.

Score Tally Frequency
1
2
3
4
5
Total 10

Tabulating Grouped (Continuous) Data

Continuous data (like height) is often grouped into classes or intervals because the exact values vary too much. We use inequalities to ensure there are no gaps and no overlaps.

Worked Example: Grouping Heights Organize the following heights ($h$) in cm into a frequency table: 152, 168, 155, 171, 159, 163.

Step-by-step working:

  1. Define the class intervals using notation like $150 < h \leq 160$.
  2. This interval means: "Heights greater than 150 but less than or equal to 160."
  3. Place each value into the correct group.
Height ($h$ cm) Tally Frequency
$150 < h \leq 160$
$160 < h \leq 170$
$170 < h \leq 180$

Notation Check:

  • $h \leq 160$: Includes 160.
  • $h < 160$: Does not include 160.
  • $h > 150$: Does not include 150.

Worked example 2 — Grouping Rainfall Data

The daily rainfall (in mm) for a month was recorded as follows: 2.3, 5.1, 0.0, 7.8, 3.2, 1.5, 6.4, 4.9, 2.8, 0.5, 5.5, 3.9, 1.1, 7.2, 4.4, 2.1, 6.9, 5.8, 3.5, 0.8, 7.5, 4.1, 1.8, 6.1, 5.3, 3.0, 0.2, 7.0, 4.7, 2.5. Create a grouped frequency table using class intervals of width 2mm, starting from 0mm.

Step 1: Determine the class intervals. Since the data ranges from 0.0 to 7.8, and we want intervals of width 2, we can use the following intervals: $0 \le r < 2$, $2 \le r < 4$, $4 \le r < 6$, $6 \le r < 8$. Note the use of $\le$ and $<$ to include the lower bound but exclude the upper bound.

Step 2: Tally each rainfall measurement into the appropriate interval. For example, 2.3 falls into the $2 \le r < 4$ interval.

Step 3: Count the tallies for each interval and record the frequency.

Rainfall ($r$ mm) Tally Frequency
$0 \le r < 2$
$2 \le r < 4$
$4 \le r < 6$
$6 \le r < 8$
Total 22

Extended Content (Extended Only)

While the Core syllabus focuses on basic classification and tabulation, the Extended syllabus expects a deeper understanding of the implications of data types on statistical analysis. Specifically, Extended students should be able to justify the choice of data representation (e.g., using a histogram for continuous data versus a bar chart for discrete data) and understand how different data types affect the calculation and interpretation of measures of central tendency (mean, median, mode) and dispersion (range, interquartile range, standard deviation).

For example, when dealing with grouped continuous data, the mean can only be estimated, as we don't know the exact values within each interval. We use the midpoint of each interval as an approximation. Similarly, the choice of whether to use the interquartile range or standard deviation as a measure of spread depends on the distribution of the data and the presence of outliers. The interquartile range is more robust to outliers, while the standard deviation provides a more complete picture of the data's variability when the data is approximately normally distributed. Understanding these nuances is crucial for conducting more sophisticated statistical analyses.

Worked example 3 — Impact of Outliers on Measures of Dispersion

Consider two datasets representing the ages of participants in a study:

Dataset A: 22, 24, 25, 26, 28 Dataset B: 22, 24, 25, 26, 65

Calculate the range and interquartile range (IQR) for both datasets and compare the results.

Step 1: Calculate the Range The range is the difference between the maximum and minimum values.

For Dataset A: Range = 28 - 22 = 6

For Dataset B: Range = 65 - 22 = 43

Step 2: Calculate the Interquartile Range (IQR) The IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).

For Dataset A: Q1 = 23 Q3 = 27 IQR = 27 - 23 = 4

For Dataset B: Q1 = 23 Q3 = 35.5 IQR = 26 - 23 = 3

Step 3: Compare the Results The range is significantly affected by the outlier (65) in Dataset B, increasing from 6 to 43. However, the IQR is much less affected, changing only slightly. This demonstrates the robustness of the IQR to outliers compared to the range.


Key Equations

There are no specific formulas for classifying data. However, ensure the total frequency ($\sum f$) equals the number of pieces of data provided.

$\sum f = n$

Where:

  • $\sum$: Sum of
  • $f$: Frequency
  • $n$: Total number of data points

This equation is not provided on the formula sheet and must be memorized.


Common Mistakes to Avoid

  • Wrong: Classifying the number of siblings a student has as continuous.
  • Right: The number of siblings is discrete because you can only have whole numbers of siblings (0, 1, 2, etc.). You cannot have 2.5 siblings.
  • Wrong: Creating overlapping groups when tabulating ages, such as $10-20$ and $20-30$. This makes it unclear which group the value "20" belongs to.
  • Right: Use inequalities ($10 \le x < 20$ and $20 \le x < 30$) so that the value "20" only fits into one group.
  • Wrong: Forgetting to account for all data points when creating a frequency table.
  • Right: After completing the frequency table, sum the frequencies and compare the total to the original number of data points. If they don't match, you've missed a value or counted one incorrectly.
  • Wrong: Assuming all numerical data is continuous.
  • Right: Carefully consider whether the numerical data can take on any value within a range (continuous) or only specific, distinct values (discrete).

Exam Tips

  • Command Words:
    • "Classify": State if the data is qualitative or quantitative (discrete/continuous). Provide a brief justification for your classification.
    • "Complete the table": Usually requires you to fill in tally marks or frequencies. Double-check your tallies and calculations.
  • Calculator Tip: Most questions in 9.1 are non-calculator as they involve counting and tallying. However, keep your calculator ready to sum large frequency columns to check your work.
  • Mark Loss Warning: Students often lose marks for incorrect tallying. Cross out the numbers in the raw data list one by one as you enter them into the tally chart to avoid double-counting or skipping values. Consider rewriting the data in ascending order before tallying to minimize errors.
  • Real-world context: Expect questions involving school surveys (number of siblings), factory production (mass of items), or nature (rainfall measurements). Pay attention to the units of measurement and the context of the data when classifying.

Practise Classifying statistical data with recent IGCSE Mathematics past papers

These are recent Cambridge IGCSE Mathematics sessions where this topic area was most heavily tested. Working through them is the fastest way to find gaps in your revision.

Test Your Knowledge

Ready to check what you've learned? Practice with 10 flashcards covering key definitions and concepts from Classifying statistical data.

Study Flashcards Practice MCQs

Frequently Asked Questions: Classifying statistical data

What is Data in Classifying statistical data?

Data: A collection of facts, such as numbers, words, measurements, or observations.

What is Qualitative (Categorical) Data in Classifying statistical data?

Qualitative (Categorical) Data: Non-numerical data that describes qualities or characteristics (e.g., eye color, car brands).

What is Quantitative Data in Classifying statistical data?

Quantitative Data: Numerical data that can be counted or measured.

What is Discrete Data in Classifying statistical data?

Discrete Data: Quantitative data that can only take specific, distinct values (e.g., the number of students in a room, shoe sizes).

What is Continuous Data in Classifying statistical data?

Continuous Data: Quantitative data that can take any value within a given range, often measured to a certain degree of accuracy (e.g., height, time, mass).

What is Frequency in Classifying statistical data?

Frequency: The number of times a particular value or category occurs in a data set.