普通的线性回归,在计算总样本误差即损失值时,对所有训练样本一视同仁,因此极少数"坏"样本会使得预测模型偏离于大多数好样本所遵循的规则,影响模型的预测精度。
import sklearn.linear_model as lm
创建学习模型对象:model=lm.LinearRegression()
训练学习模型对象:model.fit(x, y) # [x, y]-BGD->[w0, w1]
预测给定输入的输出:pred_y = model.predict(pred_x)
import pickle
import numpy as np
import sklearn.linear_model as lm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../data/single.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y)
# 创建线性回归器
model = lm.LinearRegression()
# 训练线性回归器
model.fit(x, y) # 根据梯度下降算法寻找最优的模型参数
# 测试线性回归器
pred_y = model.predict(x)
for train, pred in zip(y, pred_y):
print(train, '->', pred)
# 平均绝对值误差:mean(|y-y'|)
print(sm.mean_absolute_error(y, pred_y))
# 平均平方误差:mean((y-y')^2)
print(sm.mean_squared_error(y, pred_y))
# 中位数绝对值误差:median(|y-y'|)
print(sm.median_absolute_error(y, pred_y))
# 协方差误差分值:[-1, 1]
print(sm.explained_variance_score(y, pred_y))
# R2分值:综合以上所有指标得到的综合评价,[0, 1]
print(sm.r2_score(y, pred_y))
# 保存训练好的模型
with open('../data/linear.pkl', 'wb') as f:
pickle.dump(model, f)
mp.figure('Linear Regression', facecolor='lightgray')
mp.title('Linear Regression', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.scatter(x, y, c='dodgerblue', alpha=0.75,
s=60, label='Sample')
sorted_indices = x.T[0].argsort()
mp.plot(x[sorted_indices], pred_y[sorted_indices],
c='orangered', label='Regression')
mp.legend()
mp.show()
import pickle
import numpy as np
import sklearn.linear_model as lm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../data/single.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y)
# 从文件中加载模型
with open('../data/linear.pkl', 'rb') as f:
model = pickle.load(f)
# 测试线性回归器
pred_y = model.predict(x)
for train, pred in zip(y, pred_y):
print(train, '->', pred)
# 平均绝对值误差:mean(|y-y'|)
print(sm.mean_absolute_error(y, pred_y))
# 平均平方误差:mean((y-y')^2)
print(sm.mean_squared_error(y, pred_y))
# 中位数绝对值误差:median(|y-y'|)
print(sm.median_absolute_error(y, pred_y))
# 协方差误差分值:[-1, 1]
print(sm.explained_variance_score(y, pred_y))
# R2分值:综合以上所有指标得到的综合评价,[0, 1]
print(sm.r2_score(y, pred_y))
mp.figure('Linear Regression', facecolor='lightgray')
mp.title('Linear Regression', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.scatter(x, y, c='dodgerblue', alpha=0.75,
s=60, label='Sample')
sorted_indices = x.T[0].argsort()
mp.plot(x[sorted_indices], pred_y[sorted_indices],
c='orangered', label='Regression')
mp.legend()
mp.show()
普通的线性回归,在计算总样本误差即损失值时,对所有训练样本一视同仁,因此极少数"坏"样本会使得预测模型偏离于大多数好样本所遵循的规则,影响模型的预测精度。岭回归就是在线性回归的基础之上,为每个训练样本分配不同的权重,越是能够反应一般规律的大多数好样本所得到的权重越大,而极少数偏离于一般规律的坏样本则只能获得较低的权重,从而使得最终的预测模型尽可能偏向于多数好样本,而弱化少数坏样本对模型的影响。
因此:超参数,人为给定
model = lm.Ridge(正则强度/惩罚力度)
正则强度/惩罚力度:[0, oo)
正则强度越小,权重差异就越小,0表示无差异,等同线性回归
import numpy as np
import sklearn.linear_model as lm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
with open('../data/abnormal.txt', 'r') as f:
for line in f.readlines():
data = [float(substr) for substr
in line.split(',')]
x.append(data[:-1])
y.append(data[-1])
x = np.array(x)
y = np.array(y)
# 创建线性回归器
model1 = lm.LinearRegression()
# 训练线性回归器
model1.fit(x, y) # 根据梯度下降算法寻找最优的模型参数
# 测试线性回归器
pred_y1 = model1.predict(x)
# 线性回归的R2分值
print(sm.r2_score(y, pred_y1))
# 创建岭回归器
model2 = lm.Ridge(250)
# 训练岭回归器
model2.fit(x, y) # 通过差异化权重削弱异常样本的影响
# 测试岭回归器
pred_y2 = model2.predict(x)
# 岭回归的R2分值
print(sm.r2_score(y, pred_y2))
mp.figure('Linear & Ridge Regression', facecolor='lightgray')
mp.title('Linear & Ridge Regression', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.scatter(x, y, c='dodgerblue', alpha=0.75,
s=60, label='Sample')
sorted_indices = x.T[0].argsort()
mp.plot(x[sorted_indices], pred_y1[sorted_indices],
c='orangered', label='Linear')
mp.plot(x[sorted_indices], pred_y2[sorted_indices],
c='limegreen', label='Ridge')
mp.legend()
mp.show()
$y = w_1 + w_1x + w_2x^2 + w_3x^3 + ... + w_nx^n$
$loss = Loss(w_0, w_1, ..., w_n)$
$y = w_0 + w_1 \times 1 + w_2 \times 2 + w_3 \times 3 + ... + w_n \times n$
$x_1 -> x_1, x_2, x_3, ..., x_n$