机器学习 | 线性算法 —

机器学习 | 线性算法 —— 大禹治水

Machine-Learning: 《机器学习必修课：经典算法与Python实战》配套代码 - Gitee.com

如果说KNN算法体现了人们对空间距离的理解，

那么线性算法则体现了人们对事物趋势上的认识。

注意图中横纵坐标的不同。

线性回归、多项式回归多用于预测，逻辑回归多用于分类。

回归就是找条 “线"。

看这条线本身便是回归任务，看这条线的两边便是分类任务。

一、线性回归

一元线性回归

最优化问题
民主投票 Σ
距离的衡量
一元线性回归的解：

多元线性回归

求解为：

多项式回归 —— 使用变量替换

二、逻辑回归

逻辑回归（Logistic Function）

不光用来解决回归任务，也能解决分类任务。

本质上还是找一条线，只不过关注的不是使数据更好的在这条线上，而是分布在这条线的两边。

通常用于分类问题时，只能解决二分类问题。

sigmod函数可以将线性分布变换为非线性。

则现在的逻辑即给定X和Y，找到合适的w，拟合p

既然是投票，本质还是求距离：

逻辑回归的损失函数即：

多项式逻辑回归 —— 使用变量替换

三、线性回归代码实现

3.1、一元线性回归

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

boston = datasets.load_boston()

print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

x = boston.data[:,5]
y = boston.target

x = x[y<50]
y = y[y<50]

plt.scatter(x,y)
plt.show()

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

plt.scatter(x_train, y_train)
plt.show()

一元线性回归公式实现

def fit(x, y):
    a_up = np.sum((x-np.mean(x))*(y - np.mean(y)))
    a_bottom = np.sum((x-np.mean(x))**2)
    a = a_up / a_bottom
    b = np.mean(y) - a * np.mean(x)
    return a, b

a, b = fit(x_train, y_train)
a, b

(8.056822140369603, -28.49306872447786)

plt.scatter(x_train, y_train)
plt.plot(x_train, a*x_train+ b, c='r')
plt.show()

plt.scatter(x_test, y_test)
plt.plot(x_test, a*x_test+ b, c='r')
plt.show()

3.2、sklearn实现一元线性回归

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

lin_reg.fit(x_train.reshape(-1,1), y_train)

LinearRegression

LinearRegression()

y_predict = lin_reg.predict(x_test.reshape(-1,1))

plt.scatter(x_test, y_test)
plt.plot(x_test, y_predict, c='r')
plt.show()

3.3、sklearn 实现多元线性回归

x = boston.data
y = boston.target

x = x[y<50]
y = y[y<50]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)

lin_reg.fit(x_train, y_train)

LinearRegression

LinearRegression()

lin_reg.score(x_test, y_test)

0.7455942658788952

归一化吗？

多元线性回归中不需归一化，这是因为多元线性回归学习的就是每一维特征的权重。

from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(x_train)
x_train = standardScaler.transform(x_train)
x_test = standardScaler.transform(x_test)