R分析
RStudio简介
Rstudio:https://www.rstudio.com/products/rstudio/download/ 使用免費版的就夠用了。
功能:History,Environment,Plot,Dataset View,CLI,Script,GIT。
收費的是BS架構,可以看到Dataset顯示的js。
1. 查看数据
str
head
tail
dim
names
summary
class
levels:显示各个水平
length
attributes
nrow
ncol
2. 数据预处理
抽样
NA值处理
判断
is.na
取众数
# Since many passengers embarked at Southampton, we give them the value S.
all_data$Embarked[c(62, 830)] <- "S"
取中位数
all_data$Fare[1044] <- median(all_data$Fare, na.rm = TRUE)
取拟合回归值(连续变量)
# This time you give method = "anova" since you are predicting a continuous variable.
library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
data = all_data[!is.na(all_data$Age),], method = "anova")
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])
选取非空数据集来拟合,预测空数据集。
类型转化
as.factor(Survived)
离散化
二分离散
train$Child[train$Age] <- 0
train$Child[train$Age<18] <- 1
分段离散
<10
[10,20)
[20,30)
[30,+)
增加列和删除列
test$Survived <- rep(0,418)
test$Child <- NULL
构造数据框
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)
划分 Training & Test
训练集和测试集的构造比较好的是随机混洗,且采用多级机制。
混洗:
n <- nrow(titanic) #总的行数
shuffled <- titanic[sample(n),] #Shuffle the dataset, call the result shuffled
安装7:3的比例划分:
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]
3. 建模
机器学习的特点:
It actually involves making predictions about observations based on previous information.
分类Classfication,聚类Clustering,回归Regression
选择算法
交叉验证
Cross Validation.
set.seed(1)
accs <- rep(0,6) #初始化精度向量
for (i in 1:6) {
# 索引,按照六份的比例
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):
((i*round((1/6) * nrow(shuffled))))
train <- shuffled[-indices,]
test <- shuffled[indices,]
# 建模
tree <- rpart(Survived ~ ., train, method = "class")
# 预测
pred <- predict(tree,test,type="class")
# Assign the confusion matrix to conf
conf <- table(test$Survived,pred)
# 计算精度
accs[i] <- sum(diag(conf))/sum(conf)
}
# 取精度的平均值
mean(accs)
4. 特征提取
- Feature engineering really boils down to(歸結到) the human element in machine learning.
- How much you understand the data, with your human intuition and creativity, can make the difference.
- Enter feature engineering: creatively engineering your own features by combining the different existing variables.
- In fact, feature engineering has been described as easily the most important factor in determining the success or failure of your predictive model.