损失函数计算

正向传播计算得分

h1 = np.maximum(X.dot(W1) + b1, 0) # 第一层使用了ReLU
h2 = h1.dot(W2) + b2
scores = h2
# print(scores.shape)

正向传播计算损失

和softmax算法一样

maxScore = np.reshape(np.max(scores, axis = 1), (N, 1)) # N行1列
probability = np.exp(scores - maxScore) / np.sum(np.exp(scores - maxScore), axis = 1, keepdims = True)
correctClass = np.zeros_like(probability)
correctClass[np.arange(N), y] = 1.0

loss = -np.sum(correctClass * np.log(probability)) / N
loss += reg * (np.sum(W1 * W1) + np.sum(W2 * W2))

反向传播计算梯度

所谓的反向传播。其实就是计算梯度的链式求导法则。
在标WHY的地方，一开始并没有系数2，结果发现会有3e-1左右的误差，加上2之后就对了。。看来是以前写错了。。

dScores = probability - correctClass
dScores /= N
dW2 = np.dot(h1.T, dScores)
db2 = np.sum(dScores, axis = 0)

#ReLU是一个与0取max的函数，因此那些值小于0的变化设置为0
dh1 = np.dot(dScores, W2.T)
dh1[h1 <= 0] = 0

dW1 = np.dot(X.T, dh1)
db1 = np.sum(dh1, axis = 0)

# WHY?????
dW2 += 2 * reg * W2
dW1 += 2 * reg * W1

grads["W1"] = dW1
grads["b1"] = db1
grads["W2"] = dW2
grads["b2"] = db2

训练函数

随机选择小部分样本

index = np.random.choice(num_train, batch_size, replace = True)
X_batch = X[index,:]
y_batch = y[index]

改变权重

原来是这么写的，怕字典是复制的，所以改成了第二种写法。。

dW1, db1 = grads["W1"], grads["b1"]
dW2, db2 = grads["W2"], grads["b2"]

W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']

W1 += -learning_rate * dW1
W2 += -learning_rate * dW2
b1 += -learning_rate * db1.reshape(-1) # 这之前前面有keepdim
b2 += -learning_rate * db2.reshape(-1)

第二种写法：

# 同时修改了前面的keepdim
self.params['W2'] -= learning_rate * grads['W2']
self.params['b2'] -= learning_rate * grads['b2']
self.params['W1'] -= learning_rate * grads['W1']
self.params['b1'] -= learning_rate * grads['b1']

预测函数

# 果然very simple。。又一个自己写的QAQ
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']

h1 = np.maximum(X.dot(W1) + b1, 0) # 第一层使用了ReLU
h2 = h1.dot(W2) + b2
scores = h2

y_pred = np.argmax(scores, axis = 1)

用当前模型训练神经网络

结果发现准确率只有28.7%，并不算很高。

What’s wrong?. Looking at the visualizations above, we see that the loss is decreasing more or less linearly, which seems to suggest that the learning rate may be too low. Moreover, there is no gap between the training and validation accuracy, suggesting that the model we used has low capacity, and that we should increase its size. On the other hand, with a very large model we would expect to see more overfitting, which would manifest itself as a very large gap between the training and validation accuracy.

这说明了几点问题

损失曲线近似于线性减少，说明学习速率可能有点低
在验证集和测试集的准确率之间，没有明显区别，说明我们的训练容量太小，应该加大容量。换句话说，非常大的样本集我们会看到过拟合的现象，过拟合会导致两者的准率有一定差别。

调超参

为了提高准确率，我们需要调超参。（或者用其它任何想用的trick）
类似于之前的做法，我们在区间上选取许多点进行测试。。

learning_rates = [1.5e-3, 2.5e-3]
regularization_strengths = [1e-5, 2e-5]
layer_sizes = [40, 50, 60]
best_val = -1.0

part = 5 # 把区间平均n等分，得到n+1个点
print("Rate\t\tStrength\tHiddenSize\tAccuracy\tIsbetter")
for i in range(part + 1):
    for j in range(part + 1):
        for size in layer_sizes:
            rate = i * (learning_rates[1] - learning_rates[0]) / part + learning_rates[0] # 当前步长
            strength = j * (regularization_strengths[1] - regularization_strengths[0]) / part + regularization_strengths[0] # 当前惩罚系数
            net = TwoLayerNet(input_size, size, num_classes)
            net.train(X_train, y_train, X_val, y_val,
                num_iters=1000, batch_size=200,
                learning_rate=rate, learning_rate_decay=0.95,
                reg=strength, verbose=False)
            # 看训练集上效果怎么样
            yTrainpred = net.predict(X_train)
            trainAccuracy = np.mean(y_train == yTrainpred)
            # 看验证集上效果怎么样
            yValpred = net.predict(X_val)
            valAccuracy = np.mean(y_val == yValpred)
            # 选取验证集上最好的那个
            if(best_val < valAccuracy):
                best_val = valAccuracy
                best_net = net
                print("%f\t%.7f\t%d\t\t%f\t**" %(rate, strength, size, valAccuracy))
            else:
                print("%f\t%.7f\t%d\t\t%f" %(rate, strength, size, valAccuracy))
print("Done")

在调参的过程中，我发现了一些比较好的参数(格式放到网上就乱了。。)
Rate Strength HiddenSize ValidatoinAccuracy TestAccuracy
0.002000 0.0000150 50 0.503000 0.464
0.001700 0.0000180 60 0.495000 0.479
这就可以看到之前所谓的 gap 了

另外，对于模型可视化，这里还是能稍微看到类似轮廓的东西的。。

问题

Now that you have trained a Neural Network classifier, you may find that your testing accuracy is much lower than the training accuracy. In what ways can we decrease this gap? Select all that apply.

Train on a larger dataset.

Add more hidden units.

Increase the regularization strength.

None of the above.

Your answer:
1、3
Your explanation:
更大的训练集和惩罚系数可以增强泛化能力、但是过多隐藏层会导致模型过拟合。但要注意，样本也不应该太多（太多同样导致过拟合）。

一个bug

在训练数据的过程中，对于特定的参数，我发现了这样的问题：
遇到了log里为0的情况：

Rate 0.003200 Strength 0.00003000
C:\Users\xxx\spring1718_assignment1\assignment1\cs231n\classifiers\neural_net.py:103: RuntimeWarning: divide by zero encountered in log
loss = -np.sum(correctClass * np.log(probability)) / N

发现之前某个值是NaN，后面就一直错误了：

Rate 0.004400 Strength 0.00003000
C:\Users\xxx\spring1718_assignment1\assignment1\cs231n\classifiers\neural_net.py:99: RuntimeWarning: overflow encountered in subtract
probability = np.exp(scores - maxScore) / np.sum(np.exp(scores - maxScore), axis = 1, keepdims = True)
C:\Users\xxx\spring1718_assignment1\assignment1\cs231n\classifiers\neural_net.py:99: RuntimeWarning: invalid value encountered in subtract
probability = np.exp(scores - maxScore) / np.sum(np.exp(scores - maxScore), axis = 1, keepdims = True)
C:\Users\xxx\spring1718_assignment1\assignment1\cs231n\classifiers\neural_net.py:78: RuntimeWarning: invalid value encountered in maximum
h1 = np.maximum(X.dot(W1) + b1, 0) # 第一层使用了ReLU
D:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:83: RuntimeWarning: invalid value encountered in reduce
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
C:\Users\xxx\spring1718_assignment1\assignment1\cs231n\classifiers\neural_net.py:122: RuntimeWarning: invalid value encountered in less_equal
dh1[h1 <= 0] = 0
C:\Users\xxx\spring1718_assignment1\assignment1\cs231n\classifiers\neural_net.py:257: RuntimeWarning: invalid value encountered in maximum
h1 = np.maximum(X.dot(W1) + b1, 0) # 第一层使用了ReLU

定位这组超参，大概会在第100多次迭代的时候出现除以0的warning。
根据错误原因，是log里面出现了0，然而在取e的幂的时候已经减去了最大值，因此猜测最大值过大。
print(maxScore)，发现最后都膨胀到1e33的地步。。
所以个人猜想，有可能是因为有些超参会让神经网络变得不稳定吧。。类似一个正反馈、或者说趋向于正无穷的过程。。
遇到这些超参的话，貌似应该不选择他们，否则会影响最后输出的结果。
PS：应该进行归一化了。。