吴恩达机器学习实验（二）

Logistic回归

Logistic回归（Logistic Regression）虽然名字叫“回归”，但实际是一种最经典的二元分类算法。在分类问题中，我们的输出是一个类别，例如收到的邮件是否是垃圾邮件，肿瘤是否是阳性的等等。

Logistic回归最大的特点在于引入了sigmoid函数（又称为logistic函数）：

$f(z) = \frac{1}{1+e^{-z}}$

sigmoid函数能将输入变量展平到（0，1）的区间，我们很容易联想到概率，即划分为两个子类（1和0）的概率可以用sigmoid函数的输出值来表示。因此，我们可以拟合一条分割曲线，将输入数据喂入分割曲线，得到的数值作为sigmoid函数的自变量，即大于0代表分类为1，否则分类为0。我们采用sigmoid函数的另一大优势是sigmoid函数的导数形式很简洁。

代价函数

我们如果依旧采用平方误差函数作为代价函数，将sigmoid函数代入后发现，我们需要优化的代价函数变成了非凸的，这样不方便我们求解。因此我们选择采用信心论中常用的信息熵的方法来通过概率值（sigmoid函数输出值）衡量信息量的大小。

我们定义cost function为：

这样cost的曲线为：

很容易看出，若标签为1，当数据的sigmoid输出越接近1，cost值越小，反之，若标签为0，当数据的sigmoid输出越接近0，cost值同样越小。所以这个cost function很好的反映了数据和标签的关系。

cost function可以化简如下：

$Cost\left( h_\theta(x),y \right)=-y\times log\left( {h_\theta}\left( x \right) \right)-(1-y)\times log\left( 1-h_\theta(x) \right)$

代入代价函数为：

$J(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m}[-y^{(i)}\log \left( h_\theta(x^{(i)})\right)-\left( 1-y^{(i)} \right)\log \left( 1-h_\theta( x^{(i)}) \right)]$

得到了代价函数后，我们采用梯度下降的方法来更新参数，达到最小化代价函数的目的：

求导后得到：

$\theta_j := \theta_j - \alpha \frac{1}{m}\sum\limits_{i=1}^m\left(h_\theta\left(x^{\left( i \right)} \right)-y^\left( i \right) \right)x_j^{(i)}$

注意！！！其中的$\theta$更新依然是同时更新所有的值，所以写代码时需要在这里新开辟一个列表或矩阵存储暂时的$\theta_j$

实验描述

实验一

Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision. Your task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.

首先我们读取数据

with open('ex2data1.txt', 'r') as f:
    data = f.readlines()

score1 = []
score2 = []
admit = []
for i in data:
    tmp = i.split(',', 3)
    score1.append(float(tmp[0]))
    score2.append(float(tmp[1]))
    admit.append(int(tmp[2].replace('\n', '')))

# m为样本数量
m = len(admit)

打印出数据散点图：

可以看出，我们大致可以用一条直线来分割数据。

构造矩阵，初始化参数

# 构造数据和标签矩阵
X = np.zeros((m, 3))
X[:, 0] = np.ones(m)
X_1 = np.array(score1)
X_1 = (X_1-np.mean(X_1))/np.std(X_1)  # 特征缩放
X_2 = np.array(score2)
X_2 = (X_2-np.mean(X_2))/np.std(X_2)  # 特征缩放
X[:, 1] = X_1
X[:, 2] = X_2
y = np.array(admit).reshape((m, 1))

# 初始化参数
theta = np.zeros((1, 3))
alpha = 0.03

编写sigmoid函数

def sigmoid(X, theta):
    """
    sigmoid函数
    :param X: m*3数据矩阵
    :param theta: 1*3参数
    :return: m*1结果
    """
    return 1/(1 + np.exp(-np.dot(X, theta.T)))

编写cost function

1
2
3

def compute_cost(X, y, theta):
    # 默认以e为底数
    return np.sum(-y*np.log(sigmoid(X, theta)) - (1-y)*np.log(1-sigmoid(X, theta)))/y.shape[0]

其中对数函数默认为自然对数。

编写梯度下降函数

def gradient_descent(X, y, theta, alpha):
    tmp = np.zeros((1, 3))
    cost = [compute_cost(X, y, theta)]
    ite = 0
    while True:
        ite += 1
        for j in range(theta.shape[1]):
            tmp[0, j] = theta[0, j] - alpha * np.sum((sigmoid(X, theta)-y)*X[:, j].reshape((y.shape[0], 1)))/y.shape[0]
        theta = tmp
        cost.append(compute_cost(X, y, theta))
        print('第{}次迭代，theta为{}，cost为{}'.format(ite, theta, cost[-1]))
        if abs(cost[-1] - cost[-2]) <= 0.0001:
            print('共用了%d次迭代收敛' % ite)
            break
    return theta, cost, ite

这次我规定当cost的变化量小于0.0001时停止迭代。

打印训练过程，并编写测试器分类准确度函数

theta, cost, ite = gradient_descent(X, y, theta, alpha)
final_cost = compute_cost(X, y, theta)
print('final_cost:', final_cost)
print('theta:', theta)

def test_classifier(X, y, theta):
    correct = 0
    diff = sigmoid(X, theta) - y
    for i in range(diff.shape[0]):
        if abs(diff[i, 0]) <= 0.5:
            correct += 1
    return correct*1.0 / diff.shape[0]


correct_rate = test_classifier(X, y, theta)
print('分类器的分类正确率是%.2f%%' % (correct_rate*100))

可以看到我们分类器的正确率为90%。

cost训练曲线和分类结果

实验二

Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected.

首先做出数据的散点图：

我们可以看出两类数据明显是无法用线性分类器分开的，但是近似可以用一个椭圆来区分，因此我们将$(x_1,x_2,x_1^2,x_2^2)$作为五个特征，这样可以转换非线性分类器为线性分类器。

为了防止过拟合，我这次采用添加了惩罚项的cost function：

def compute_cost(X, y, theta, lam):
    # 默认以e为底数
    punish = np.sum(np.power(theta, 2))*lam/(2*y.shape[0])
    return np.sum(-y*np.log(sigmoid(X, theta)) - (1-y)*np.log(1-sigmoid(X, theta)))/y.shape[0] + punish

可以看出我添加了一项惩罚项，目的是防止$\theta$值过大，造成过拟合，现在的cost function变成了：

$J(\theta)=\frac{1}{2m}\sum\limits_{i=1}^m[(h_\theta(x^{(i)})-y^{(i)})^{2}+\lambda \sum\limits_{j=1}^n\theta _j^2)]$

编写梯度下降函数

由于添加了正则项，在更新参数的过程中每次还要加上$\frac{\lambda}{m}\theta_j$（除了$\theta_0$，因为我们未对其正则化）。

$\theta_j:=\theta_j-\alpha[\frac{1}{m}\sum\limits_{i=1}^{m}(({h_\theta}(x^{(i)})-y^{(i)})x_{j}^{(i)}+\frac{\lambda }{m}\theta_j]$

def gradient_descent(X, y, theta, alpha):
    tmp = np.zeros((1, 5))
    cost = [compute_cost(X, y, theta, lam)]
    ite = 0
    while True:
        ite += 1
        for j in range(theta.shape[1]):
            if not j:
                tmp[0, j] = theta[0, j] * (1 - lam * alpha / y.shape[0]) - alpha * np.sum(
                    (sigmoid(X, theta) - y) * X[:, j].reshape((y.shape[0], 1))) / y.shape[0]
            else:
                tmp[0, j] = theta[0, j]*(1-lam*alpha/y.shape[0]) - alpha*np.sum((sigmoid(X, theta)-y)*X[:, j].reshape((y.shape[0], 1)))/y.shape[0]
        theta = tmp
        cost.append(compute_cost(X, y, theta, lam))
        print('第{}次迭代，theta为{}，cost为{}'.format(ite, theta, cost[-1]))
        if abs(cost[-1] - cost[-2]) <= 0.0001:
            print('共用了%d次迭代收敛' % ite)
            break
    return theta, cost, ite

查看结果，做出cost下降曲线

theta, cost, ite = gradient_descent(X, y, theta, alpha)
final_cost = compute_cost(X, y, theta, lam)
print('final_cost:', final_cost)
print('theta:', theta)

分类器正确率为77.12%。

cost下降曲线：

实验问题

最大的问题集中在数学推导信息熵的导数以及利用经过特征缩放的数据拟合出的曲线的参数和实际参数的关系。后者还未解决。

完整代码

import numpy as np
import matplotlib.pyplot as plt

with open('ex2data1.txt', 'r') as f:
    data = f.readlines()

score1 = []
score2 = []
admit = []
for i in data:
    tmp = i.split(',', 3)
    score1.append(float(tmp[0]))
    score2.append(float(tmp[1]))
    admit.append(int(tmp[2].replace('\n', '')))

# m为样本数量
m = len(admit)


# 绘制散点图
score1_pos = []
score1_neg = []
score2_pos = []
score2_neg = []
for idx, i in enumerate(admit):
    if i:
        score1_pos.append(score1[idx])
        score2_pos.append(score2[idx])
    else:
        score1_neg.append(score1[idx])
        score2_neg.append(score2[idx])

plt.figure()
plt.scatter(score1_pos, score2_pos, label='admitted', c='r', marker='o')
plt.scatter(score1_neg, score2_neg, label='rejected', c='b', marker='x')
plt.legend()

# 构造数据和标签矩阵
X = np.zeros((m, 3))
X[:, 0] = np.ones(m)
X_1 = np.array(score1)
X_1 = (X_1-np.mean(X_1))/np.std(X_1)  # 特征缩放
X_2 = np.array(score2)
X_2 = (X_2-np.mean(X_2))/np.std(X_2)  # 特征缩放
X[:, 1] = X_1
X[:, 2] = X_2
y = np.array(admit).reshape((m, 1))


# 初始化参数
theta = np.zeros((1, 3))
alpha = 0.03


def sigmoid(X, theta):
    """
    sigmoid函数
    :param X: m*3数据矩阵
    :param theta: 1*3参数
    :return: m*1结果
    """
    return 1/(1 + np.exp(-np.dot(X, theta.T)))


def compute_cost(X, y, theta):
    # 默认以e为底数
    return np.sum(-y*np.log(sigmoid(X, theta)) - (1-y)*np.log(1-sigmoid(X, theta)))/y.shape[0]


def gradient_descent(X, y, theta, alpha):
    tmp = np.zeros((1, 3))
    cost = [compute_cost(X, y, theta)]
    ite = 0
    while True:
        ite += 1
        for j in range(theta.shape[1]):
            tmp[0, j] = theta[0, j] - alpha * np.sum((sigmoid(X, theta)-y)*X[:, j].reshape((y.shape[0], 1)))/y.shape[0]
        theta = tmp
        cost.append(compute_cost(X, y, theta))
        print('第{}次迭代，theta为{}，cost为{}'.format(ite, theta, cost[-1]))
        if abs(cost[-1] - cost[-2]) <= 0.0001:
            print('共用了%d次迭代收敛' % ite)
            break
    return theta, cost, ite


theta, cost, ite = gradient_descent(X, y, theta, alpha)
final_cost = compute_cost(X, y, theta)
print('final_cost:', final_cost)
print('theta:', theta)


def test_classifier(X, y, theta):
    correct = 0
    diff = sigmoid(X, theta) - y
    for i in range(diff.shape[0]):
        if abs(diff[i, 0]) <= 0.5:
            correct += 1
    return correct*1.0 / diff.shape[0]


correct_rate = test_classifier(X, y, theta)
print('分类器的分类正确率是%.2f%%' % (correct_rate*100))

# 作出分类直线
xaxis = [i for i in np.linspace(-2, 2, 50)]
yaxis = list(map(lambda x: (-theta[0, 1]*x-theta[0, 0])/theta[0, 2], xaxis))

plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.plot(xaxis, yaxis, c='k')
# plt.show()

# 作出训练损失曲线
plt.figure()
plt.plot([i for i in range(ite)], cost[1: ], c='b')
plt.grid()

plt.show()