Read multiple excel spreadsheets into R using readxl and correct variable types

2018-06-13 10:28:04

I have several excel files that I am trying to read into R using the package readxl . The excel files consist of several tabs each of 60000 rows having four columns of variables. The first column is a simple integer count to track seconds from 0, 1, 2, etc. The second column is colon separated ( : ) time in HH:MM:SS. The third column is the forward slash separated ( / ) date as MM/DD/YYYY. The fourth column is a floating point decimal (eg 338.6).

Using the following code I get four columns and some of the formatting is consistent, but some data appears to be misinterpreted as dates or decimal numbers instead of integers, time, or date.

    > data1 <- lapply(excel_sheets("./file_name.xls"),
                      read_excel, path = "./file_name.xls",
                      col_names = FALSE)
    > head(data1[[1]])
          X1       X2         X3    X4
    1 502342 02:12:50 02/04/2015 338.6
    2 502341 02:12:49 02/04/2015 338.1
    3 502340 02:12:48 02/04/2015 337.5
    4 502339 02:12:47 02/04/2015 337.6
    5 502338 02:12:46 02/04/2015 337.5
    6 502337 02:12:45 02/04/2015 338.0

    > head(data1[[2]])
            X1       X2     X3       X4
    1   483664 08:56:48 488774 08:52:22
    2 08:49:32 08:56:47 488774 08:52:22
    3    185.2 08:56:46 488774   485475
    4   483663 08:56:45 488774 08:52:22
    5 08:49:31 08:56:44 488774 08:52:22
    6   483662 08:56:43 488774   485475
    > class(data1[[2]]$X1)
    [1] "character"
    > mode(data1[[2]]$X1)
    [1] "character"

    > tail(data1[[1]])
                X1       X2     X3       X4
    59995 08:49:35 08:56:54 488774 08:52:22
    59996   483666 08:56:53 488774   485475
    59997 08:49:34 08:56:52 488774 08:51:50
    59998    185.3 08:56:51 488774 08:51:50
    59999   483665 08:56:50 488774   485475
    60000 08:49:33 08:56:49 488774   485475
    > tail(data1[[2]])
                X1       X2     X3     X4
    59995 09:29:17   497592 488774 488206
    59996   485927   497591 488774 488206
    59997 09:29:16   497590 488774 488206
    59998   485926    363.0 488774 488206
    59999 09:29:15 12:49:37 488774 488206
    60000   485925   497588 488774 488206

I also try using col_types to define the column types, but this returns a data frame full of NA's.

    > data1 <- lapply(excel_sheets("./file_name.xls"),
                      read_excel, path = "./file_name.xls",
                      col_names = FALSE,
                      col_types = c("numeric", "numeric", "date","numeric"))
    There were 50 or more warnings (use warnings() to see the first 50)

    > head(data1[[1]])
      X1 X2   X3 X4
    1 NA NA <NA> NA
    2 NA NA <NA> NA
    3 NA NA <NA> NA
    4 NA NA <NA> NA
    5 NA NA <NA> NA
    6 NA NA <NA> NA

Using lapply() with read_excel() returns a list of data frames. I'm not sure if I should try to change variable types or how exactly to do this. The excel files themselves look consistent in terms of variable types. I even checked line 59998 in data1[[2]] which shows 363.0 for X2, but it should be 03:42:51.

Should I try to format these data in excel or try to change it in R? Everything currently appears to be class character. What would be the most effective way to change the variable types in R?

Thanks for your help.

链接地址: http://www.djcxy.com/p/38322.html

上一篇: 用于R导入的阵列数据的最有效格式？

下一篇: 使用readxl和正确的变量类型将多个excel电子表格读入R中