R分析

RStudio简介

Rstudio:https://www.rstudio.com/products/rstudio/download/ 使用免費版的就夠用了。

功能:History,Environment,Plot,Dataset View,CLI,Script,GIT。

收費的是BS架構,可以看到Dataset顯示的js。

1. 查看数据

str
head
tail
dim
names
summary
class
levels:显示各个水平
length
attributes
nrow
ncol

2. 数据预处理

抽样

NA值处理

判断

is.na

取众数

# Since many passengers embarked at Southampton, we give them the value S.
all_data$Embarked[c(62, 830)] <- "S"

取中位数

all_data$Fare[1044] <- median(all_data$Fare, na.rm = TRUE)

取拟合回归值(连续变量)

# This time you give method = "anova" since you are predicting a continuous variable.
library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
                       data = all_data[!is.na(all_data$Age),], method = "anova")
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])

选取非空数据集来拟合,预测空数据集。

类型转化

as.factor(Survived)

离散化

二分离散

train$Child[train$Age] <- 0
train$Child[train$Age<18] <- 1

分段离散

<10
[10,20)
[20,30)
[30,+)

增加列和删除列

test$Survived <- rep(0,418)
test$Child <- NULL

构造数据框

my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)

划分 Training & Test

训练集和测试集的构造比较好的是随机混洗,且采用多级机制。

混洗:

n <- nrow(titanic) #总的行数
shuffled <- titanic[sample(n),] #Shuffle the dataset, call the result shuffled

安装7:3的比例划分:

train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]

3. 建模

机器学习的特点:

It actually involves making predictions about observations based on previous information.

分类Classfication,聚类Clustering,回归Regression

选择算法

交叉验证

Cross Validation.

set.seed(1)
accs <- rep(0,6) #初始化精度向量
for (i in 1:6) {
  # 索引,按照六份的比例
  indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):
  ((i*round((1/6) * nrow(shuffled))))

  train <- shuffled[-indices,]
  test <- shuffled[indices,]

  # 建模
  tree <- rpart(Survived ~ ., train, method = "class")

  # 预测
  pred <- predict(tree,test,type="class")

  # Assign the confusion matrix to conf
  conf <- table(test$Survived,pred)

  # 计算精度
  accs[i] <- sum(diag(conf))/sum(conf)
}

# 取精度的平均值
mean(accs)

4. 特征提取

  • Feature engineering really boils down to(歸結到) the human element in machine learning.
  • How much you understand the data, with your human intuition and creativity, can make the difference.
  • Enter feature engineering: creatively engineering your own features by combining the different existing variables.
  • In fact, feature engineering has been described as easily the most important factor in determining the success or failure of your predictive model.

results matching ""

    No results matching ""