The aggregate() Function
The aggregate() function in R is used to compute summary statistics of data grouped by one or more factors. It is particularly useful when you want to calculate statistics like the mean, sum, or median of a variable, split by levels of one or more grouping variables.
Basic Syntax
aggregate(x, by, FUN, ...)
- x: The data to be aggregated (typically a numeric vector or data frame).
- by: A list of factors or lists of factors to group the data by.
- FUN: The function to apply to each group (e.g., mean, sum, median).
- …: Additional arguments for the function.
Detailed Examples
Aggregating a Single Numeric Vector
Example 1: Calculate the mean of a numeric vector by a factor
# Create a data frame data <- data.frame( group = factor(c("A", "A", "B", "B", "C", "C")), value = c(10, 20, 30, 40, 50, 60) ) # Aggregate to find the mean of 'value' for each 'group' result <- aggregate(value ~ group, data = data, FUN = mean) print(result) # Output: # group value # 1 A 15 # 2 B 35 # 3 C 55
In this example, aggregate() computes the mean of the value column for each level of the group factor.
Aggregating with Multiple Factors
Example 2: Calculate the sum of a numeric variable grouped by two factors
# Create a more complex data frame data <- data.frame( group1 = factor(c("A", "A", "B", "B", "A", "B")), group2 = factor(c("X", "Y", "X", "Y", "X", "Y")), value = c(10, 20, 30, 40, 50, 60) ) # Aggregate to find the sum of 'value' by 'group1' and 'group2' result <- aggregate(value ~ group1 + group2, data = data, FUN = sum) print(result) # Output: # group1 group2 value # 1 A X 60 # 2 A Y 20 # 3 B X 30 # 4 B Y 100
Here, aggregate() calculates the sum of value for each combination of group1 and group2.
Using Custom Functions
Example 3: Apply a custom function to compute the range of values
# Create a data frame data <- data.frame( group = factor(c("A", "A", "B", "B", "A", "B")), value = c(10, 15, 30, 35, 25, 40) ) # Define a custom function to calculate range range_fun <- function(x) { return(max(x) - min(x)) } # Aggregate to find the range of 'value' for each 'group' result <- aggregate(value ~ group, data = data, FUN = range_fun) print(result) # Output: # group value # 1 A 15 # 2 B 10
In this example, a custom function range_fun is used to calculate the range (difference between the maximum and minimum) of value for each group.
Aggregating Data Frames
Example 4: Aggregating multiple columns
# Create a data frame with multiple numeric columns data <- data.frame( group = factor(c("A", "A", "B", "B", "A", "B")), value1 = c(10, 20, 30, 40, 50, 60), value2 = c(5, 15, 25, 35, 45, 55) ) # Aggregate to find the mean of 'value1' and 'value2' for each 'group' result <- aggregate(. ~ group, data = data, FUN = mean) print(result) # Output: # group value1 value2 # 1 A 26 21 # 2 B 43 30
Here, aggregate() calculates the mean for multiple columns (value1 and value2) by the group factor.
Key Points to Remember
- Grouping Variables: The by argument specifies the grouping variables. You can group by one or more factors.
- Aggregation Function: The FUN argument determines which summary statistic is computed. It can be any function that takes a numeric vector and returns a single value (e.g., mean, sum, median, or a custom function).
- Data Frames and Vectors: The aggregate() function can handle both data frames (where multiple columns can be aggregated) and numeric vectors (where only one column is aggregated).
Summary
The aggregate() function in R is a powerful tool for summarizing data based on grouping factors. It allows you to compute various summary statistics, such as means, sums, or custom functions, across different levels of factors. By using aggregate(), you can easily analyze and interpret complex datasets by breaking them down into manageable groupings.