机器学习-基础分类算法-KNN详解

KNN-k近邻算法

k-Nearest Neighbors

  • 思想极度简单
  • 应用数学只是少
  • 效果好
  • 可以解释机器学习算法使用过程中的很多细节问题
  • 更完整的刻画机器学习应用的流程

image.png
image.png
image.png
创建简单测试用例

import numpy as np
import matplotlib.pyplot as plt
raw_data_X = [[3.393533211, 2.331273381],
              [3.110073483, 1.781539638],
              [1.343808831, 3.368360954],
              [3.582294042, 4.679179110],
              [2.280362439, 2.866990263],
              [7.423436942, 4.696522875],
              [5.745051997, 3.533989803],
              [9.172168622, 2.511101045],
              [7.792783481, 3.424088941],
              [7.939820817, 0.791637231]
             ]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)
plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], color='g')
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], color='r')
plt.show()

无标题.png
预测

x = np.array([8.093607318, 3.365731514])

plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1], color='g')
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1], color='r')
plt.scatter(x[0], x[1], color='b')
plt.show()

无标题.png
KNN过程

from math import sqrt
distances = []
for x_train in X_train:
    d = sqrt(np.sum((x_train - x)**2))
    distances.append(d)

image.png
image.png

distances = [sqrt(np.sum((x_train - x)**2))
             for x_train in X_train]

数组排序返回索引

np.argsort(distances)
nearest = np.argsort(distances)

最近k个点相近的y坐标

k = 6
topK_y = [y_train[neighbor] for neighbor in nearest[:k]]
topK_y

image.png
统计

from collections import Counter
votes = Counter(topK_y)
votes.most_common(1)
predict_y = votes.most_common(1)[0][0]

image.png

封装

import numpy as np
from math import sqrt
from collections import Counter


def kNN_classify(k, X_train, y_train, x):

    assert 1 <= k <= X_train.shape[0], "k must be valid"
    assert X_train.shape[0] == y_train.shape[0], \
        "the size of X_train must equal to the size of y_train"
    assert X_train.shape[1] == x.shape[0], \
        "the feature number of x must be equal to X_train"

    distances = [sqrt(np.sum((x_train - x)**2)) for x_train in X_train]
    nearest = np.argsort(distances)

    topK_y = [y_train[i] for i in nearest[:k]]
    votes = Counter(topK_y)

    return votes.most_common(1)[0][0]

  • k近邻算法是非常特殊的,可以被认为是没有模型的算法
  • 为了和其他算法统一,可以认为训练数据集就是模型本身

使用scikit-learn中的kNN

raw_data_X = [[3.393533211, 2.331273381],
              [3.110073483, 1.781539638],
              [1.343808831, 3.368360954],
              [3.582294042, 4.679179110],
              [2.280362439, 2.866990263],
              [7.423436942, 4.696522875],
              [5.745051997, 3.533989803],
              [9.172168622, 2.511101045],
              [7.792783481, 3.424088941],
              [7.939820817, 0.791637231]
             ]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)

x = np.array([8.093607318, 3.365731514])
from sklearn.neighbors import KNeighborsClassifier
kNN_classifier = KNeighborsClassifier(n_neighbors=6)
kNN_classifier.fit(X_train, y_train)
#预测
kNN_classifier.predict(x)
#转化为矩阵
X_predict = x.reshape(1, -1)
kNN_classifier.predict(X_predict)
y_predict = kNN_classifier.predict(X_predict)

image.png
优化封装的KNN

import numpy as np
from math import sqrt
from collections import Counter


class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分类器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根据训练数据集X_train和y_train训练kNN分类器"""
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"
        assert self.k <= X_train.shape[0], \
            "the size of X_train must be at least k."

        self._X_train = X_train
        self._y_train = y_train
        return self

    def predict(self, X_predict):
        """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
        assert self._X_train is not None and self._y_train is not None, \
                "must fit before predict!"
        assert X_predict.shape[1] == self._X_train.shape[1], \
                "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """给定单个待预测数据x,返回x的预测结果值"""
        assert x.shape[0] == self._X_train.shape[1], \
            "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))
                     for x_train in self._X_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def __repr__(self):
        return "KNN(k=%d)" % self.k

image.png

判断机器学习算法的性能-训练数据分割测试数据

image.png
分一部分数据设置为测试数据
image.png
train test split
加载数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets 
iris = datasets.load_iris()
iris.keys()
X = iris.data
y = iris.target
X.shape
y.shape

image.png
**分离出一部分数据做训练,另外一部分数据做测试 **
将索引随机排列

shuffled_indexes = np.random.permutation(len(X))
shuffled_indexes

设置测试数据百分比
获取测试数据

test_ratio = 0.2
test_size = int(len(X) * test_ratio)
test_indexes = shuffled_indexes[:test_size]
train_indexes = shuffled_indexes[test_size:]
X_train = X[train_indexes]
y_train = y[train_indexes]
X_test = X[test_indexes]
y_test = y[test_indexes]

image.png
封装算法

import numpy as np
def train_test_split(X, y, test_ratio=0.2, seed=None):
    """将数据 X 和 y 按照test_ratio分割成X_train, X_test, y_train, y_test"""
    assert X.shape[0] == y.shape[0], \
        "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0, \
        "test_ration must be valid"

    if seed:
        np.random.seed(seed)

    shuffled_indexes = np.random.permutation(len(X))

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffled_indexes[:test_size]
    train_indexes = shuffled_indexes[test_size:]

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

image.png
sklearn中的train_test_split
image.png

分类准确度

加载手写数字数据

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
digits = datasets.load_digits()
digits.keys()
X = digits.data
X.shape
y = digits.target
y.shape

image.png
查看数据

some_digit = X[666]
some_digit_image = some_digit.reshape(8, 8)
import matplotlib
import matplotlib.pyplot as plt
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary)
plt.show()

image.png
预测准确率

from playML.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_ratio=0.2)
from playML.kNN import KNNClassifier

my_knn_clf = KNNClassifier(k=3)
my_knn_clf.fit(X_train, y_train)
y_predict = my_knn_clf.predict(X_test)
sum(y_predict == y_test) / len(y_test)

image.png
封装自己的accuracy_score

import numpy as np


def accuracy_score(y_true, y_predict):
    '''计算y_true和y_predict之间的准确率'''
    assert y_true.shape[0] == y_predict.shape[0], \
        "the size of y_true must be equal to the size of y_predict"

    return sum(y_true == y_predict) / len(y_true)

image.png
scikit-learn中的accuracy_score

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)

超参数和模型参数

  • 超参数:在算法运行前需要决定的参数
  • 模型参数:算法过程中学习的参数

KNN算法没有模型参数
KNN算法中的K是典型的超参数
寻找好的超参数

  • 领域知识
  • 经验数值
  • 实验搜索

加载手写数字数据集

import numpy as np
from sklearn import datasets
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

寻找最好的k、

best_score = 0.0
best_k = -1
for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
        
print("best_k =", best_k)
print("best_score =", best_score)

image.png
image.png
image.png
加入是否考虑距离

best_score = 0.0
best_k = -1
best_method = ""
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
        
print("best_method =", best_method)
print("best_k =", best_k)
print("best_score =", best_score)

明可夫斯基距离
image.png

best_score = 0.0
best_k = -1
best_p = -1

for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_p = p
            best_score = score
        
print("best_k =", best_k)
print("best_p =", best_p)
print("best_score =", best_score)

网格搜索 Grid Search

定义搜索参数

param_grid = [
    {
        'weights': ['uniform'], 
        'n_neighbors': [i for i in range(1, 11)]
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)], 
        'p': [i for i in range(1, 6)]
    }
]
knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(knn_clf, param_grid)
grid_search.fit(X_train, y_train)

查看最佳分类器参数和准确度

grid_search.best_estimator_
grid_search.best_score_

image.png

更多的距离定义

  • 向量空间余弦相似度 Cosine Similarity
  • 调整余弦相似度 Adjusted Cosine Similarity
  • 皮尔森相关系数 Pearson Correlation Coefficient
  • Jaccard相似吸收 Jaccard Coeffcient

数据归一化 Feature Scaling

将所有的数据映射到同一尺度
最值归一化 normalization:把所有数据映射到0-1之间
image.png
适用于分布有明显边界的情况;受outlier影响较大

x = np.random.randint(0, 100, 100) 
(x - np.min(x)) / (np.max(x) - np.min(x))

image.png

X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float)
X[:,0] = (X[:,0] - np.min(X[:,0])) / (np.max(X[:,0]) - np.min(X[:,0]))
X[:,1] = (X[:,1] - np.min(X[:,1])) / (np.max(X[:,1]) - np.min(X[:,1]))
X[:10,:]

image.png
均值方差归一化 standardization
数据分布没有明显的边界,有可能存在极端数据值
均值方差归一化:把所有数据归一到均值为0方差为1的分布中
image.png

X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)
X2[:,0] = (X2[:,0] - np.mean(X2[:,0])) / np.std(X2[:,0])
X2[:,1] = (X2[:,1] - np.mean(X2[:,1])) / np.std(X2[:,1])

image.png

测试数据集的归一化

测试数据说模拟真实环境

  • 真实环境很有可能无法得到所有测试数据的均值和方差
  • 对数据的归一化也是算法的一部分

(x_test-mean_train)/std_train

scikit-learn中的Scaler

image.png
加载数据

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

数据分割

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=666)

数据归一化

from sklearn.preprocessing import StandardScaler 
standardScalar = StandardScaler() 
standardScalar.fit(X_train)
standardScalar.transform(X_train)#归一化

image.png
归一化结果赋值

X_train = standardScalar.transform(X_train)
X_test_standard = standardScalar.transform(X_test) 

使用归一化后的数据进行knn分类

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test_standard, y_test)

实现自己的standardScaler

import numpy as np


class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根据训练数据集X获得数据的均值和方差"""
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """将X根据这个StandardScaler进行均值方差归一化处理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None, \
               "must fit before transform!"
        assert X.shape[1] == len(self.mean_), \
               "the feature number of X must be equal to mean_ and std_"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX

有关k近邻算法

解决分类问题
天然可以解决多分类问题
思想简单,效果强大
最大缺点:效率低下
如果训练集有m个样本,n个特征,则预测每一个新的数据,需要O(m*n)
优化 使用树结构:KD-Tree,Ball-Tree
缺点2:高度数据相关
缺点3:预测结果不具有可解释性
维数灾难
随着维度的增加,“看似相近”的两个点之间的距离越来越大
解决方法:降维、

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/368804.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

Flutter实现轮播图功能

一、在pubspec.yaml中添加&#xff1a; dependencies:# 轮播图card_swiper: ^3.0.1card_swiper: ^3.0.1&#xff0c;要获取最新版本&#xff1a;https://pub-web.flutter-io.cn/packages/card_swiper/versions&#xff0c;这个里面有文档可以看&#xff0c;如下图&#xff1a;…

大模型ReAct智能体开发实战

哆啦A梦是很多人都熟悉的角色&#xff0c;包括我自己。 在成长过程中&#xff0c;我常常对他口袋里的许多小玩意感到惊讶&#xff0c;而且他知道何时使用它们。 随着大型语言模型 (LLM) 的发展趋势&#xff0c;你也可以构建一个具有相同行为方式的模型&#xff01; 我们将构建…

高中数学立体几何练习题3

用到的基础知识&#xff1a; 1. 2.

MATLAB计算多边形质心/矩心

前言&#xff1a;不规则四边形的中心 不规则四边形的出心有多种定义&#xff0c;以下是最常见的三种&#xff1a; 1.重心&#xff1a;重心是四边形内部所有顶点连线交点的平均位置。可以通过求解四个顶点坐标的平均值来找到重心。 2.质心&#xff1a;质心是四边形内部所有质点…

Python机器学习库(numpy库)

文章目录 Python机器学习库&#xff08;numpy库&#xff09;1. 数据的维度2. numpy基础知识2.1 numpy概述2.1 numpy概述2.1 numpy概述2.2 numpy库的引用 3. ndarray数组的创建3.1 N维数组对象ndarray3.2 创建ndarray数组3.2.1 使用Python列表、元组创建ndarray数组3.2.2 使用nu…

029 命令行传递参数

1.循环输出args字符串数组 public class D001 {public static void main(String[] args) {for (String arg : args) {System.out.println(arg);}} } 2.找打这个类的路径&#xff0c;打开cmd cmd C:\Users\Admin\IdeaProjects\JavaSE学习之路\scanner\src\com\yxm\demo 3. 编译…

Servlet+Ajax实现对数据的列表展示(极简入门)

目录 1.准备工作 1.数据库源&#xff08;这里以Mysql为例&#xff09; 2.映射实体类 3.模拟三层架构&#xff08;Dao、Service、Controller&#xff09; Dao接口 Dao实现 Service实现&#xff08;这里省略Service接口&#xff09; Controller层&#xff08;或叫Servlet层…

2024济南生物发酵展:会议日程安排和技术装备亮点预告

2024济南发酵展/2024生物发酵展/2024山东发酵展/2024济南生物制药展/2024生物技术展/2024食品设备展/2024食品加工展/2024济南细胞工程展 由中国生物发酵产业协会主办&#xff0c;上海信世展览服务有限公司承办的2024第12届国际生物发酵产品与技术装备展览会&#xff08;济南&a…

深入理解Istio服务网格数据平面Envoy

一、服务网格概述(service mesh) 在传统的微服务架构中&#xff0c;服务间的调用&#xff0c;业务代码需要考虑认证、熔断、服务发现等非业务能力&#xff0c;在某种程度上&#xff0c;表现出了一定的耦合性 服务网格追求高级别的服务流量治理能力&#xff0c;认证、熔断、服…

2023.12 淘天-数科 已offer

文章目录 岗位信息1面ld 12.17 1H2面 VP 12.18 40min3面 HR 12.2012.21offer薪资方案沟通 岗位信息 1面ld 12.17 1H &#xff08;是一个从业估计很长时间前辈&#xff0c;很平和&#xff0c;感觉能学到很多东西&#xff09; 自我介绍项目深究1.说下自己工作里最有成就感的事和…

图论练习3

内容&#xff1a;过程中视条件改变边权&#xff0c;利用树状数组区间加处理 卯酉东海道 题目链接 题目大意 个点&#xff0c;条有向边&#xff0c;每条边有颜色和费用总共有种颜色若当前颜色与要走的边颜色相同&#xff0c;则花费为若当前颜色与要走的边颜色不同&#xff0c;…

MYSQL——MySQL8.3无法启动

在新电脑上装了个MySQL&#xff0c;但是无法使用net start mysql启动&#xff0c;很是纳闷&#xff0c;使用mysqld --console去查看报错&#xff0c;也是没报错的&#xff0c;但是奇怪的是&#xff0c;我输入完这个mysqld --console之后&#xff0c;就等于启动了mysql了&#x…

第十一篇【传奇开心果系列】Python的OpenCV技术点案例示例:三维重建

传奇开心果短博文系列 系列短博文目录Python的OpenCV技术点案例示例系列短博文目录一、前言二、OpenCV三维重建介绍三、基于区域的SGBM示例代码四、BM(Block Matching)算法介绍和示例代码五、基于能量最小化的GC(Graph Cut)算法介绍和示例代码六、相机标定介绍和示例代码七…

【数据结构与算法】之排序系列-20240203

这里写目录标题 一、628. 三个数的最大乘积二、645. 错误的集合三、747. 至少是其他数字两倍的最大数四、905. 按奇偶排序数组五、922. 按奇偶排序数组 II六、976. 三角形的最大周长 一、628. 三个数的最大乘积 简单 给你一个整型数组 nums &#xff0c;在数组中找出由三个数组…

Leetcode刷题笔记题解(C++):36. 有效的数独

思路一&#xff1a;暴力破解&#xff0c;两个二维数组记录行、列对应的数字出现的次数&#xff0c;比如rows[i][index]表示的数字index在i行出现的次数&#xff0c;三维数组记录每个块中对应数字出现的次数&#xff0c;比如boxes[i/3][j/3][index]表示的数字index在[i/3][j/3]个…

Hugging Face推出自定义AI聊天Assistants;谷歌推出图像生成工具 ImageFX

&#x1f989; AI新闻 &#x1f680; 谷歌推出图像生成工具 ImageFX 摘要&#xff1a;谷歌在 Imagen 2 的基础上推出新的图像生成工具 ImageFX&#xff0c;通过简单的文字提示可以生成高质量图像。该工具包含了提示界面&#xff0c;让用户可以快速尝试创作和想法的相邻维度。…

数据结构—基础知识:哈夫曼树

文章目录 数据结构—基础知识&#xff1a;哈夫曼树哈夫曼树的基本概念哈夫曼树的构造算法哈夫曼树的构造过程哈夫曼算法的实现算法&#xff1a;构造哈夫曼树 数据结构—基础知识&#xff1a;哈夫曼树 哈夫曼树的基本概念 哈夫曼&#xff08;Huffman&#xff09;树又称最优树&…

通过 ChatGPT 的 Function Call 查询数据库

用 Function Calling 的方式实现手机流量包智能客服的例子。 def get_sql_completion(messages, model"gpt-3.5-turbo"):response client.chat.completions.create(modelmodel,messagesmessages,temperature0,tools[{ # 摘自 OpenAI 官方示例 https://github.com/…

ASP.NET Core 自定义解压缩提供程序

写在前面 在了解ASP.NET Core 自定义请求解压缩中间件的应用时&#xff0c;依据官方文档操作下来碰到了几个问题&#xff0c;这边做个记录。 关键点就是配置 Content-Encoding&#xff0c;参数需要和代码中添加的提供程序的Key保持一致&#xff1b; builder.Services.AddRequ…

问题:媒体查询语法中, 可用设备名参数表示“文档打印或预览“的是 #媒体#媒体#其他

问题&#xff1a;媒体查询语法中, 可用设备名参数表示"文档打印或预览"的是 A、C.?screen B.?projection C、A.?print D.?speech 参考答案如图所示