The cut() Function
The cut() function in R is used to divide a numeric vector into intervals or bins and to assign labels to these intervals. This is useful for converting continuous variables into categorical ones.
Basic Syntax
cut(x, breaks, labels = FALSE, include.lowest = FALSE, right = TRUE, ...)
- x: Numeric vector to be divided.
- breaks: Number of intervals or a vector of cut points to define the intervals.
- labels: Logical or character vector for labeling the intervals. Defaults to FALSE, meaning the intervals are represented numerically.
- include.lowest: Logical; if TRUE, the lowest interval includes the smallest data point.
- right: Logical; if TRUE, the intervals include the right endpoint, otherwise the left endpoint.
- …: Additional arguments.
Detailed Examples
Dividing a Numeric Vector into Equal Intervals
Example 1: Dividing ages into 3 groups
# Create a numeric vector of ages ages <- c(22, 25, 29, 35, 42, 50, 60) # Divide ages into 3 intervals age_groups <- cut(ages, breaks = 3, labels = c("Young", "Adult", "Senior")) print(age_groups) # Output: # [1] Young Young Young Adult Adult Senior Senior
Levels: Young Adult Senior
In this example, cut() divides the ages vector into 3 equal intervals and labels them accordingly.
Specifying Cut Points Manually
Example 2: Dividing ages into custom intervals
# Divide ages into custom intervals age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60, 70), labels = c("20-30", "30-40", "40-50", "50-60", "60-70")) print(age_groups) # Output: # [1] 20-30 20-30 30-40 40-50 50-60 50-60 60-70
Levels: 20-30 30-40 40-50 50-60 60-70
Here, cut() divides the ages into intervals defined by specific cut points and assigns custom labels.
Including the Lower Boundary of Intervals
Example 3: Including the lowest boundary
# Divide ages into intervals including the lowest boundary age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60), include.lowest = TRUE) print(age_groups) # Output: # [1] [20,30] [20,30] [30,40] [40,50] [50,60] [50,60] [50,60]
Levels: [20,30] [30,40] [40,50] [50,60]
In this case, include.lowest = TRUE means the lowest interval includes the smallest data point.
Excluding Upper Boundaries of Intervals
Example 4: Excluding the upper boundary
# Divide ages into intervals excluding the upper boundary age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60), right = FALSE) print(age_groups) # Output: # [1] [20,30) [20,30) [30,40) [40,50) [50,60) [50,60) [50,60)
Levels: [20,30) [30,40) [40,50) [50,60)
By setting right = FALSE, the intervals exclude the upper boundary and include the lower boundary.
Creating Intervals with Custom Labels
Example 5: Labeling intervals with specific names
# Create custom labels for intervals age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60), labels = c("Young", "Young Adult", "Adult", "Senior")) print(age_groups) # Output: # [1] Young Young Young Young Adult Adult Senior
Levels: Young Young Adult Adult Senior
Here, the intervals are labeled with custom names.
Key Points to Remember
- Defining Intervals: Use breaks to specify either the number of intervals or the exact cut points.
- Labels: The labels argument allows you to name the intervals. If not provided, intervals are shown as numeric ranges.
- Including Boundaries: include.lowest and right control the inclusion of interval boundaries.
- Usage: cut() is useful for converting continuous variables into categorical factors, which can simplify data analysis and visualization.
Summary
The cut() function in R is a powerful tool for transforming continuous numeric data into categorical factors by dividing the data into specified intervals. You can define the intervals, include or exclude boundaries, and customize interval labels to better understand and analyze your data.