有沒有更好的方法來計算中位數(不是平均值)

[英]Is there a better way to calculate the median (not average)


Suppose I have the following table definition:

假設我有以下表定義:

CREATE TABLE x (i serial primary key, value integer not null);

I want to calculate the MEDIAN of value (not the AVG). The median is a value that divides the set in two subsets containing the same number of elements. If the number of elements is even, the median is the average of the biggest value in the lowest segment and the lowest value of the biggest segment. (See wikipedia for more details.)

我想計算值的MEDIAN(不是AVG)。中位數是在包含相同數量元素的兩個子集中划分集合的值。如果元素的數量是偶數,則中位數是最低段中的最大值和最大段的最低值的平均值。 (有關詳細信息,請參閱維基百科。)

Here is how I manage to calculate the MEDIAN but I guess there must be a better way:

以下是我設法計算MEDIAN的方法,但我想必須有更好的方法:

SELECT AVG(values_around_median) AS median
  FROM (
    SELECT
       DISTINCT(CASE WHEN FIRST_VALUE(above) OVER w2 THEN MIN(value) OVER w3 ELSE MAX(value) OVER w2 END)
        AS values_around_median
      FROM (
        SELECT LAST_VALUE(value) OVER w AS value,
               SUM(COUNT(*)) OVER w > (SELECT count(*)/2 FROM x) AS above
          FROM x
          GROUP BY value
          WINDOW w AS (ORDER BY value)
          ORDER BY value
        ) AS find_if_values_are_above_or_below_median
      WINDOW w2 AS (PARTITION BY above ORDER BY value DESC),
             w3 AS (PARTITION BY above ORDER BY value ASC)
    ) AS find_values_around_median

Any ideas?

有任何想法嗎?

7 个解决方案

#1


14  

Indeed there IS an easier way. In Postgres you can define your own aggregate functions. I posted functions to do median as well as mode and range to the PostgreSQL snippets library a while back.

確實有一種更簡單的方法。在Postgres中,您可以定義自己的聚合函數。我發布函數來做中位數以及模式和范圍一段時間后PostgreSQL片段庫。

http://wiki.postgresql.org/wiki/Aggregate_Median

http://wiki.postgresql.org/wiki/Aggregate_Median

#2


21  

Yes, with PostgreSQL 9.4, you can use the newly introduced inverse distribution function PERCENTILE_CONT(), an ordered-set aggregate function that is specified in the SQL standard as well.

是的,使用PostgreSQL 9.4,您可以使用新引入的反向分布函數PERCENTILE_CONT(),這是一個在SQL標准中指定的有序集合函數。

WITH t(value) AS (
  SELECT 1   UNION ALL
  SELECT 2   UNION ALL
  SELECT 100 
)
SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY value)
FROM
  t;

This emulation of MEDIAN() via PERCENTILE_CONT() is also documented here.

此處還記錄了通過PERCENTILE_CONT()對MEDIAN()的仿真。

#3


7  

A simpler query for that:

一個更簡單的查詢:

WITH y AS (
   SELECT value, row_number() OVER (ORDER BY value) AS rn
   FROM   x
   WHERE  value IS NOT NULL
   )
, c AS (SELECT count(*) AS ct FROM y) 
SELECT CASE WHEN c.ct%2 = 0 THEN
          round((SELECT avg(value) FROM y WHERE y.rn IN (c.ct/2, c.ct/2+1)), 3)
       ELSE
                (SELECT     value  FROM y WHERE y.rn = (c.ct+1)/2)
       END AS median
FROM   c;

Major points

  • Ignores NULL values.
  • 忽略NULL值。
  • Core feature is the row_number() window function, which has been there since version 8.4
  • 核心功能是row_number()窗口函數,自8.4版本以來一直存在
  • The final SELECT gets one row for uneven numbers and avg() of two rows for even numbers. Result is numeric, rounded to 3 decimal places.
  • 最后的SELECT為不均勻的數字獲取一行,為偶數數字獲取兩行的avg()。結果為數字,四舍五入到小數點后3位。

Test shows, that the new version is 4x faster than (and yields correct results, unlike) the query in the question:

測試表明,新版本比問題中的查詢快4倍(並產生正確的結果):

CREATE TEMP TABLE x (value int);
INSERT INTO x SELECT generate_series(1,10000);
INSERT INTO x VALUES (NULL),(NULL),(NULL),(3);

#4


0  

For googlers: there is also http://pgxn.org/dist/quantile Median can be calculated in one line after installation of this extension.

對於googlers:還有http://pgxn.org/dist/quantile安裝此擴展后,可以在一行中計算中位數。

#5


0  

Simple sql with native postgres functions only:

只有原生postgres函數的簡單sql:

select 
    case count(*)%2
        when 1 then (array_agg(num order by num))[count(*)/2+1]
        else ((array_agg(num order by num))[count(*)/2]::double precision + (array_agg(num order by num))[count(*)/2+1])/2
    end as median
from unnest(array[5,17,83,27,28]) num;

Sure you can add coalesce() or something if you want to handle nulls.

當然,如果要處理空值,可以添加coalesce()或其他內容。

#6


0  

CREATE TABLE array_table (id integer, values integer[]) ;

INSERT INTO array_table VALUES ( 1,'{1,2,3}');
INSERT INTO array_table VALUES ( 2,'{4,5,6,7}');

select id, values, cardinality(values) as array_length,
(case when cardinality(values)%2=0 and cardinality(values)>1 then (values[(cardinality(values)/2)]+ values[((cardinality(values)/2)+1)])/2::float 
 else values[(cardinality(values)+1)/2]::float end) as median  
 from array_table

Or you can create a function and use it any where in your further queries.

或者,您可以創建一個函數,並在進一步查詢的任何位置使用它。

CREATE OR REPLACE FUNCTION median (a integer[]) 
RETURNS float AS    $median$ 
Declare     
    abc float; 
BEGIN    
    SELECT (case when cardinality(a)%2=0 and cardinality(a)>1 then 
           (a[(cardinality(a)/2)] + a[((cardinality(a)/2)+1)])/2::float   
           else a[(cardinality(a)+1)/2]::float end) into abc;    
    RETURN abc; 
END;    
$median$ 
LANGUAGE plpgsql;

select id,values,median(values) from array_table

#7


0  

Use the Below function for Finding nth percentile

使用Below函數查找第n個百分位數

CREATE or REPLACE FUNCTION nth_percentil(anyarray, int)
    RETURNS 
        anyelement as 
    $$
        SELECT $1[$2/100.0 * array_upper($1,1) + 1] ;
    $$ 
LANGUAGE SQL IMMUTABLE STRICT;

In Your case it's 50th Percentile.

在你的情況下,它是第50百分位。

Use the Below Query to get the Median

使用以下查詢獲取中位數

SELECT nth_percentil(ARRAY (SELECT Field_name FROM table_name ORDER BY 1),50)

This will give you 50th percentile which is the median basically.

這將給你50個百分位,這基本上是中位數。

Hope this is helpful.

希望這有用。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2010/09/17/989fe0db83fbdd1f61d0f21cda54e8e7.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com