The cut() Function with R

The cut() Function

The cut() function in R is used to divide a numeric vector into intervals or bins and to assign labels to these intervals. This is useful for converting continuous variables into categorical ones.

Basic Syntax 

cut(x, breaks, labels = FALSE, include.lowest = FALSE, right = TRUE, ...)
  • x: Numeric vector to be divided.
  • breaks: Number of intervals or a vector of cut points to define the intervals.
  • labels: Logical or character vector for labeling the intervals. Defaults to FALSE, meaning the intervals are represented numerically.
  • include.lowest: Logical; if TRUE, the lowest interval includes the smallest data point.
  • right: Logical; if TRUE, the intervals include the right endpoint, otherwise the left endpoint.
  • : Additional arguments.

Detailed Examples 

Dividing a Numeric Vector into Equal Intervals

Example 1: Dividing ages into 3 groups

# Create a numeric vector of ages
ages <- c(22, 25, 29, 35, 42, 50, 60)
# Divide ages into 3 intervals
age_groups <- cut(ages, breaks = 3, labels = c("Young", "Adult", "Senior"))
print(age_groups)
# Output:
# [1] Young Young Young Adult Adult Senior Senior

Levels: Young Adult Senior

In this example, cut() divides the ages vector into 3 equal intervals and labels them accordingly.

Specifying Cut Points Manually

Example 2: Dividing ages into custom intervals 

# Divide ages into custom intervals
age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60, 70),
                   labels = c("20-30", "30-40", "40-50", "50-60", "60-70"))
print(age_groups)
# Output:
# [1] 20-30 20-30 30-40 40-50 50-60 50-60 60-70

Levels: 20-30 30-40 40-50 50-60 60-70

Here, cut() divides the ages into intervals defined by specific cut points and assigns custom labels.

Including the Lower Boundary of Intervals

Example 3: Including the lowest boundary 

# Divide ages into intervals including the lowest boundary
age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60), include.lowest = TRUE)
print(age_groups)
# Output:
# [1] [20,30] [20,30] [30,40] [40,50] [50,60] [50,60] [50,60]

Levels: [20,30] [30,40] [40,50] [50,60]

In this case, include.lowest = TRUE means the lowest interval includes the smallest data point.

Excluding Upper Boundaries of Intervals

Example 4: Excluding the upper boundary 

# Divide ages into intervals excluding the upper boundary
age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60), right = FALSE)
print(age_groups)
# Output:
# [1] [20,30) [20,30) [30,40) [40,50) [50,60) [50,60) [50,60)

Levels: [20,30) [30,40) [40,50) [50,60)

By setting right = FALSE, the intervals exclude the upper boundary and include the lower boundary.

Creating Intervals with Custom Labels

Example 5: Labeling intervals with specific names 

# Create custom labels for intervals
age_groups <- cut(ages, breaks = c(20, 30, 40, 50, 60),
                   labels = c("Young", "Young Adult", "Adult", "Senior"))
print(age_groups)
# Output:
# [1] Young Young Young Young Adult Adult Senior

Levels: Young Young Adult Adult Senior

Here, the intervals are labeled with custom names.

Key Points to Remember

  • Defining Intervals: Use breaks to specify either the number of intervals or the exact cut points.
  • Labels: The labels argument allows you to name the intervals. If not provided, intervals are shown as numeric ranges.
  • Including Boundaries: include.lowest and right control the inclusion of interval boundaries.
  • Usage: cut() is useful for converting continuous variables into categorical factors, which can simplify data analysis and visualization.

Summary

The cut() function in R is a powerful tool for transforming continuous numeric data into categorical factors by dividing the data into specified intervals. You can define the intervals, include or exclude boundaries, and customize interval labels to better understand and analyze your data.

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *

Facebook
Twitter
LinkedIn
WhatsApp
Email
Print