K近邻算法
补全assignment1中的KNN算法。
每次我绞尽脑汁写了一个c++风格的代码,又跑不出结果后。。偷偷摸摸看了一眼别人的代码。。WC还有这个函数。。WC还有这种写法。。。emmmmm自闭了
双层循环
# 这个可以和之前最近邻进行一下类比,注意提取每一行的写法
dists[i][j] = np.sqrt(np.sum(np.square(self.X_train[j,:] - X[i,:])))
问题1
- What in the data is the cause behind the distinctly bright rows?
- What causes the columns?
答:
那些看起来与其他训练图片都不像的测试图片会显得更亮
有些训练图片离所有测试图片都很远。。
总之。。就是有些点会比较孤立。。
预测函数
# argsort用于给出排序之后对应的原来的下标
# 我们取下标的前k个,给出对应的y_train[i]的值
# bincount即桶排序,用于求出每个数出现的次数
# 之后再返回那个出现次数最多的,因为是优先选最小的,直接argmax即可
closest_y = self.y_train[np.argsort(dists[i,:])[:k]]
y_pred[i] = np.argmax(np.bincount(closest_y))
问题2
We can also other distance metrics such as L1 distance.
The performance of a Nearest Neighbor classifier that uses L1 distance will not change if (Select all that apply.):
- The data is preprocessed by subtracting the mean.
- The data is preprocessed by subtracting the mean and dividing by the standard deviation.
- The coordinate axes for the data are rotated.
- None of the above.
Your Answer:
1、2
Your explanation:
首先L1范数会随坐标轴变化而变化,首先排除3.
对于1、2,线性变换不会改变L1范数相对次序,所以不变。
之后用F范数来检查两个矩阵是否有差异。。
单层循环
单层循环更像是用个语法糖,与双层循环差别不大
dists[i,:] = np.sqrt(np.sum(np.square(self.X_train - X[i,:]), axis = 1))
无循环
不用循环的主要思想是(x - y) ^ 2 = x ^ 2 - 2xy + y ^ 2
这样可以简化计算复杂度
dists = np.multiply(np.dot(X, self.X_train.T), -2)
# keepdims保留了其维数,使得在之后的加法里可以广播
dists = np.add(dists, np.sum(np.square(X), axis = 1, keepdims = True))
dists = np.add(dists, np.sum(np.square(self.X_train), axis = 1))
dists = np.sqrt(dists)
比较花费时间
Two loop version took 33.835573 seconds
One loop version took 74.891847 seconds
No loop version took 0.359040 seconds
可见还是无循环的快很多
交叉验证
将数据集分成测试集和验证集,调出最好的超参。
将数据等分
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
交叉验证
for k in k_choices:
accuracies = []
for i in range(num_folds):
# 将数组间竖直排列
XtrainSets = np.vstack(X_train_folds[0 : i] + X_train_folds[i + 1:])
# 将数组间水平排列
ytrainSets = np.hstack(y_train_folds[0 : i] + y_train_folds[i + 1:])
XtestSets = X_train_folds[i]
ytestSets = y_train_folds[i]
classifier.train(XtrainSets, ytrainSets)
dists = classifier.compute_distances_no_loops(XtestSets)
# 由于最后一格看起来是与别的变量冲突了,直接全部加上了下划线。。
_y_test_pred = classifier.predict_labels(dists, k)
_num_correct = np.sum(_y_test_pred == ytestSets)
_num_test = ytestSets.shape[0]
accuracy = float(_num_correct) / _num_test
accuracies.append(accuracy)
k_to_accuracies[k] = accuracies
选择超参
best_k = 10
此时正确率有28.2%
问题3
Which of the following statements about k-Nearest Neighbor (k-NN) are true in a classification setting, and for all k? Select all that apply.
- The training error of a 1-NN will always be better than that of 5-NN.
- The test error of a 1-NN will always be better than that of a 5-NN.
- The decision boundary of the k-NN classifier is linear.
- The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
- None of the above.
Your Answer:
1 4
Your explanation:
- 1NN对于训练集的误差肯定没有,但对于测试集可能会出现较大误差。因此虽然1NN训练误差小,但是不一定能进行有效的预测。
- 这个不一定。因为1NN存在过拟合、噪音影响等问题
- 决策边界不是线性的,kNN本来就不是线性选择器
- 因为要与每个训练集作比较,时间肯定会增长