K近邻算法

补全assignment1中的KNN算法。

每次我绞尽脑汁写了一个c++风格的代码,又跑不出结果后。。偷偷摸摸看了一眼别人的代码。。WC还有这个函数。。WC还有这种写法。。。emmmmm自闭了

双层循环

# 这个可以和之前最近邻进行一下类比,注意提取每一行的写法
dists[i][j] = np.sqrt(np.sum(np.square(self.X_train[j,:] - X[i,:])))

问题1

  • What in the data is the cause behind the distinctly bright rows?
  • What causes the columns?

答:
那些看起来与其他训练图片都不像的测试图片会显得更亮
有些训练图片离所有测试图片都很远。。
总之。。就是有些点会比较孤立。。

预测函数

# argsort用于给出排序之后对应的原来的下标
# 我们取下标的前k个,给出对应的y_train[i]的值
# bincount即桶排序,用于求出每个数出现的次数
# 之后再返回那个出现次数最多的,因为是优先选最小的,直接argmax即可
closest_y = self.y_train[np.argsort(dists[i,:])[:k]]
y_pred[i] = np.argmax(np.bincount(closest_y))

问题2

We can also other distance metrics such as L1 distance.
The performance of a Nearest Neighbor classifier that uses L1 distance will not change if (Select all that apply.):

  1. The data is preprocessed by subtracting the mean.
  2. The data is preprocessed by subtracting the mean and dividing by the standard deviation.
  3. The coordinate axes for the data are rotated.
  4. None of the above.

Your Answer:
1、2
Your explanation:
首先L1范数会随坐标轴变化而变化,首先排除3.
对于1、2,线性变换不会改变L1范数相对次序,所以不变。
之后用F范数来检查两个矩阵是否有差异。。

单层循环

单层循环更像是用个语法糖,与双层循环差别不大

dists[i,:] = np.sqrt(np.sum(np.square(self.X_train - X[i,:]), axis = 1))

无循环

不用循环的主要思想是(x - y) ^ 2 = x ^ 2 - 2xy + y ^ 2
这样可以简化计算复杂度

dists = np.multiply(np.dot(X, self.X_train.T), -2)
# keepdims保留了其维数,使得在之后的加法里可以广播
dists = np.add(dists, np.sum(np.square(X), axis = 1, keepdims = True))
dists = np.add(dists, np.sum(np.square(self.X_train), axis = 1))
dists = np.sqrt(dists)

比较花费时间

Two loop version took 33.835573 seconds
One loop version took 74.891847 seconds
No loop version took 0.359040 seconds
可见还是无循环的快很多

交叉验证

将数据集分成测试集和验证集,调出最好的超参。

将数据等分

X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)

交叉验证

for k in k_choices:
    accuracies = []
    for i in range(num_folds):
        # 将数组间竖直排列
        XtrainSets = np.vstack(X_train_folds[0 : i] + X_train_folds[i + 1:])
        # 将数组间水平排列
        ytrainSets = np.hstack(y_train_folds[0 : i] + y_train_folds[i + 1:])
        XtestSets = X_train_folds[i]
        ytestSets = y_train_folds[i]

        classifier.train(XtrainSets, ytrainSets)
        dists = classifier.compute_distances_no_loops(XtestSets)
        # 由于最后一格看起来是与别的变量冲突了,直接全部加上了下划线。。
        _y_test_pred = classifier.predict_labels(dists, k)

        _num_correct = np.sum(_y_test_pred == ytestSets)
        _num_test = ytestSets.shape[0]

        accuracy = float(_num_correct) / _num_test
        accuracies.append(accuracy)
    k_to_accuracies[k] = accuracies

选择超参

best_k = 10
此时正确率有28.2%

问题3

Which of the following statements about k-Nearest Neighbor (k-NN) are true in a classification setting, and for all k? Select all that apply.

  1. The training error of a 1-NN will always be better than that of 5-NN.
  2. The test error of a 1-NN will always be better than that of a 5-NN.
  3. The decision boundary of the k-NN classifier is linear.
  4. The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
  5. None of the above.

Your Answer:
1 4
Your explanation:

  1. 1NN对于训练集的误差肯定没有,但对于测试集可能会出现较大误差。因此虽然1NN训练误差小,但是不一定能进行有效的预测。
  2. 这个不一定。因为1NN存在过拟合、噪音影响等问题
  3. 决策边界不是线性的,kNN本来就不是线性选择器
  4. 因为要与每个训练集作比较,时间肯定会增长