Cleaning mixed decimal separators after Excel import (gsub maybe?)

I needed to read several Excel files and used the gdata package. Unfortunately the files were formated lazily, some with "," decimal/thousand separators some with "." and some with none.

To get you an idea, the numbers can look like this:

#Five Times 1000.1 and four times 1000.0
x <- c("1,000.1","1.000.1","1000.1","1000,1","1.000,1","1000","1,000","1.000","1000.0")
x

Is there a general way to convert these into 1000.1 and 1000.0 respectively? I thought about using gsub() and a regexp.

A first gsub() to replace the "," with "." and for a second gsub() a regexp might be done in a way that all "." which have three numbers to the right of it are deleted while the other "." are kept.

However I'm not familiar with regexp and don't know how to do that. Can anybody help? Is there a simpler way to clean excel sheets?

Thanks!


Using gsub for example:

 as.numeric(gsub('([0-9])[,|.]?([0-9]{3})[,|.]?','12.',x))
[1] 1000.1 1000.1 1000.1 1000.1 1000.1 1000.0 1000.0 1000.0 1000.0

For this specific case you can even simplify the regular expression to:

 as.numeric(gsub('^(1)[,|.]?(0{3})[,|.]?','12.',x))

And here I decorticate the last regular expression:

 ^         | 1  | [,|.]?          |   0{3}    |    [,|.]?          |   (0|1)?
 beginwith | 1  | comma or point  |  3 zeros  |   comma or point   | 0 or 1 or nothing
链接地址: http://www.djcxy.com/p/6530.html

上一篇: 我可以加载RData文件而绕过加载名称空间吗?

下一篇: Excel导入后清理混合十进制分隔符(gsub也许?)