The MEDIAN() function
The MEDIAN() function is used to find the median value of a dataset, which is the middle value when the data is sorted in ascending or descending order. Unlike the average, which can be influenced by extreme values, the median provides a measure of central tendency that is less affected by outliers.
Overview of MEDIAN()
- Median: The median is the value separating the higher half from the lower half of a dataset. For a dataset with an odd number of values, the median is the middle value. For an even number of values, it is the average of the two middle values.
Syntax
In SQL, MEDIAN() is not a standard SQL function, and its support varies among different database systems. However, some database systems and extensions provide ways to calculate the median. Here’s how it can be approached in various systems:
PostgreSQL
PostgreSQL doesn’t have a built-in MEDIAN() function, but you can calculate the median using window functions and common table expressions (CTEs):
WITH ordered_data AS ( SELECT value, row_number() OVER (ORDER BY value) AS row_num, count(*) OVER () AS total_count FROM table ) SELECT AVG(value) AS median FROM ordered_data WHERE row_num IN (total_count / 2, (total_count + 1) / 2);
Explanation:
- row_number() assigns a unique number to each row based on the ordered values.
- count(*) OVER () gets the total number of rows.
- The AVG(value) calculates the median by averaging the middle values.
Oracle
Oracle supports the MEDIAN() function directly:
SELECT MEDIAN(column_name) AS median_value FROM table;
Explanation:
MEDIAN(column_name) calculates the median of the values in the specified column.
MySQL
MySQL does not have a built-in MEDIAN() function, but you can calculate the median using a similar approach to PostgreSQL, involving variables or subqueries.
Example for MySQL:
SET @row_index := 0; SET @total_rows := (SELECT COUNT(*) FROM table); SELECT AVG(value) AS median FROM ( SELECT value FROM table ORDER BY value LIMIT @row_index, 1 UNION ALL SELECT value FROM table ORDER BY value LIMIT LEAST(@total_rows - @row_index - 1, 1), 1 ) AS median_data;
Explanation:
- @row_index and @total_rows are variables used to compute the median.
- The LIMIT and UNION ALL clauses are used to select the middle value(s).
SQL Server
SQL Server does not have a built-in MEDIAN() function, but you can use common table expressions (CTEs) to compute the median:
WITH OrderedValues AS ( SELECT value, ROW_NUMBER() OVER (ORDER BY value) AS RowAsc, ROW_NUMBER() OVER (ORDER BY value DESC) AS RowDesc, COUNT(*) OVER () AS TotalCount FROM table ) SELECT AVG(value) AS median FROM OrderedValues WHERE RowAsc IN (TotalCount / 2, (TotalCount + 1) / 2);
Explanation:
- ROW_NUMBER() assigns a row number in ascending and descending order.
- COUNT(*) OVER () calculates the total number of rows.
- AVG(value) computes the median by averaging the middle values.
Examples
Example 1: Calculating the Median of Salaries
Assume you have a table salaries with the column salary:
PostgreSQL Example:
WITH ordered_salaries AS ( SELECT salary, row_number() OVER (ORDER BY salary) AS row_num, count(*) OVER () AS total_count FROM salaries ) SELECT AVG(salary) AS median_salary FROM ordered_salaries WHERE row_num IN (total_count / 2, (total_count + 1) / 2);
Oracle Example:
SELECT MEDIAN(salary) AS median_salary FROM salaries;
Example 2: Median for a Group of Data
To calculate the median salary for each department:
Oracle Example:
SELECT department_id, MEDIAN(salary) AS median_salary FROM employees GROUP BY department_id;
PostgreSQL Example:
WITH ordered_salaries AS ( SELECT department_id, salary, row_number() OVER (PARTITION BY department_id ORDER BY salary) AS row_num, count(*) OVER (PARTITION BY department_id) AS total_count FROM employees ) SELECT department_id, AVG(salary) AS median_salary FROM ordered_salaries WHERE row_num IN (total_count / 2, (total_count + 1) / 2) GROUP BY department_id;
Key Points
- Column Data Types: MEDIAN() can be used with numeric columns or date columns. It’s not directly applicable to text columns.
- Handling Even and Odd Rows: For an odd number of rows, MEDIAN() returns the middle value. For an even number of rows, it returns the average of the two middle values.
- Performance: Calculating the median, especially on large datasets or in complex queries, can be resource-intensive. Ensure that your database schema and indexes support efficient querying.
Conclusion
The MEDIAN() function provides valuable insights into the central tendency of your data, particularly when you need to understand the middle point of a dataset. While not universally supported across all SQL systems, alternative methods can achieve similar results.