Using dplyr window functions to calculate percentiles

2018-06-08 04:27:13

I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.

Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:

library(dplyr)
library(tidyr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

# old dplyr solution 
mtcars %>% group_by(cyl) %>% 
  do(data.frame(p=p, stats=quantile(.$mpg, probs=p), 
                n = length(.$mpg), avg = mean(.$mpg))) %>%
  spread(p, stats) %>%
  select(1, 4:6, 3, 2)

# note: the select and spread statements are just to get the data into
#       the format in which I'd like to see it, but are not critical

Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.

Thank you

If you're up for using purrr::map , you can do it like this!

library(dplyr)
library(tidyr)
library(broom)
library(purrr)

mtcars %>%
  nest(-cyl) %>%
  mutate(Quantiles = map(data, ~ quantile(.$mpg))) %>% 
  unnest(map(Quantiles, tidy))

#> # A tibble: 15 × 3
#>      cyl names     x
#>    <dbl> <chr> <dbl>
#> 1      6    0% 17.80
#> 2      6   25% 18.65
#> 3      6   50% 19.70
#> 4      6   75% 21.00
#> 5      6  100% 21.40
#> 6      4    0% 21.40
#> 7      4   25% 22.80
#> 8      4   50% 26.00
#> 9      4   75% 30.40
#> 10     4  100% 33.90
#> 11     8    0% 10.40
#> 12     8   25% 14.40
#> 13     8   50% 15.20
#> 14     8   75% 16.25
#> 15     8  100% 19.20

One nice thing about this approach is the output is tidy, one observation per row.

Here's a dplyr approach that avoids do but requires a separate call to quantile for each quantile value.

mtcars %>% group_by(cyl) %>%
  summarise(`25%`=quantile(mpg, probs=0.25),
            `50%`=quantile(mpg, probs=0.5),
            `75%`=quantile(mpg, probs=0.75),
            avg=mean(mpg),
            n=n())

  cyl   25%  50%   75%      avg  n
1   4 22.80 26.0 30.40 26.66364 11
2   6 18.65 19.7 21.00 19.74286  7
3   8 14.40 15.2 16.25 15.10000 14

It would be better if summarise could return multiple values with a single call to quantile , but this appears to be an open issue in dplyr development.

UPDATE: Here's a variation on @JuliaSilge's answer that uses nesting to get the quantiles, but without the use of map . It does, however, require an extra line of code to add a column listing the quantile levels, as I'm not sure how (or if it's possible) to capture the names of the quantiles into a separate column directly from the call to quantile .

p = c(0.25,0.5,0.75)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(quantiles = list(sprintf("%1.0f%%", p*100)),
            mpg = list(quantile(mpg, p))) %>% 
  unnest

    cyl quantiles   mpg
1     4       25% 22.80
2     4       50% 26.00
3     4       75% 30.40
4     6       25% 18.65
5     6       50% 19.70
6     6       75% 21.00
7     8       25% 14.40
8     8       50% 15.20
9     8       75% 16.25

This is a dplyr approach that uses the tidy() function of the broom package, unfortunately it still requires do() , but it is a lot simpler.

library(dplyr)
library(broom)

mtcars %>%
    group_by(cyl) %>%
    do( tidy(t(quantile(.$mpg))) )

which gives:

    cyl   X0.  X25.  X50.  X75. X100.
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1     4  21.4 22.80  26.0 30.40  33.9
2     6  17.8 18.65  19.7 21.00  21.4
3     8  10.4 14.40  15.2 16.25  19.2

Note the use of t() since the broom package does not have a method for named numerics.

This is based on my earlier answer for summary() here.

链接地址: http://www.djcxy.com/p/24838.html

上一篇: 预测多元线性模型的值

下一篇: 使用dplyr窗口函数来计算百分位数