0 回归-海上风电出力预测

https://www.dcic-china.com/competitions/10098

分析一下:特征工程如何做。


  1. 时间特征: 小时、分钟、一个星期中的第几天、一个月中的第几天。这些可以作为周期特征的标识。比如周六周日的人流会有很大的波动,这些如果不告诉模型它是很难学习到知识的。
  2. 业务特征: 这方面需要查阅相关的知识点了。操作基本都是在 对单个特征特殊处理f(x),两个特征之间做四则运算。同一业务特征做加减,不同领域特征做乘除。最好做出来的特征有实际的物理意义。
  3. 历史序列特征:滑动窗口、移动平均等等;我之前参加过一个 做的特征工作是爆炸式的,也是惊讶了我,但是别人的结果是真的好。这玩意真有点迷,做尝试吧。
  4. label处理。比如回归,如果能降低当前标签的量纲一定要做。可以与某个及其相关的特征做除法(减法),缩小变化,这样防止模型预测的结果不可控。

import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier, CatBoostRegressor
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import tqdm
import sys
import os
import gc
import argparse
import warnings
warnings.filterwarnings('ignore')

# 读取数据
train_info = pd.read_csv('../data/first_data/A榜-训练集_海上风电预测_基本信息.csv', encoding='gbk')
train_df = pd.read_csv('../data/first_data/A榜-训练集_海上风电预测_气象变量及实际功率数据.csv', encoding='gbk')

test_info = pd.read_csv('../data/first_data/B榜-测试集_海上风电预测_基本信息.csv', encoding='gbk')
test_df = pd.read_csv('../data/first_data/B榜-测试集_海上风电预测_气象变量数据.csv', encoding='gbk')

submit_example = pd.read_csv('../data/first_data/submit_example.csv')

train_df = train_df.merge(train_info[['站点编号','装机容量(MW)']], on=['站点编号'], how='left')
test_df = test_df.merge(test_info[['站点编号','装机容量(MW)']], on=['站点编号'], how='left')

train_df['站点编号'] = train_df['站点编号'].apply(lambda x:int(x[1]))
test_df['站点编号'] = test_df['站点编号'].apply(lambda x:int(x[1]))

train_df.columns = ['stationId','time','airPressure','relativeHumidity','cloudiness','10mWindSpeed','10mWindDirection',
                 'temperature','irradiation','precipitation','100mWindSpeed','100mWindDirection','power','capacity']

test_df.columns = ['stationId','time','airPressure','relativeHumidity','cloudiness','10mWindSpeed','10mWindDirection',
                 'temperature','irradiation','precipitation','100mWindSpeed','100mWindDirection','capacity']

# 特征组合
train_df['100mWindSpeed/10mWindSpeed'] = train_df['100mWindSpeed'] / (train_df['10mWindSpeed'] + 0.0000001)
test_df['100mWindSpeed/10mWindSpeed'] = test_df['100mWindSpeed'] / (test_df['10mWindSpeed'] + 0.0000001)

train_df['100mWindDirection/10mWindDirection'] = train_df['100mWindDirection'] / (train_df['10mWindDirection'] + 0.0000001)
test_df['100mWindDirection/10mWindDirection'] = test_df['100mWindDirection'] / (test_df['10mWindDirection'] + 0.0000001)

train_df['10mWindDirection_new'] = train_df['10mWindDirection'] - 180
test_df['10mWindDirection_new'] = test_df['10mWindDirection'] - 180

# 差值
train_df['100mWindSpeed_10mWindSpeed'] = train_df['100mWindSpeed'] - train_df['10mWindSpeed'] 
test_df['100mWindSpeed_10mWindSpeed'] = test_df['100mWindSpeed'] - test_df['10mWindSpeed']

train_df['100mWindDirection_10mWindDirection'] = train_df['100mWindDirection'] - train_df['10mWindDirection']
test_df['100mWindDirection_10mWindDirection'] = test_df['100mWindDirection'] - test_df['10mWindDirection']

# 风切变指数
train_df['WindSpeed/WindDirectio'] = train_df['100mWindSpeed/10mWindSpeed'] / train_df['100mWindDirection/10mWindDirection']
test_df['WindSpeed/WindDirectio'] = test_df['100mWindSpeed/10mWindSpeed'] / test_df['100mWindDirection/10mWindDirection']

train_df['100mWindSpeed/10mWindSpeed_2'] = train_df['100mWindSpeed/10mWindSpeed'].apply(lambda x:np.log10(x)) / 10
test_df['100mWindSpeed/10mWindSpeed_2'] = test_df['100mWindSpeed/10mWindSpeed'].apply(lambda x:np.log10(x)) / 10

# 湿度/温度
train_df['relativeHumidity/temperature'] = train_df['relativeHumidity'] / (train_df['temperature'] + 0.0000001)
test_df['relativeHumidity/temperature'] = test_df['relativeHumidity'] / (test_df['temperature'] + 0.0000001)

# 辐射/温度
train_df['irradiation/temperature'] = train_df['irradiation'] / (train_df['temperature'] + 0.0000001)
test_df['irradiation/temperature'] = test_df['irradiation'] / (test_df['temperature'] + 0.0000001)

# 辐射/云量
train_df['irradiation/cloudiness'] = train_df['irradiation'] / (train_df['cloudiness'] + 0.0000001)
test_df['irradiation/cloudiness'] = test_df['irradiation'] / (test_df['cloudiness'] + 0.0000001)

# 是否降水
train_df['is_precipitation'] = train_df['precipitation'].apply(lambda x:1 if x>0 else 0)
test_df['is_precipitation'] = test_df['precipitation'].apply(lambda x:1 if x>0 else 0)

def get_time_feature(df, col):
    
    df_copy = df.copy()
    prefix = col + "_"
    df_copy[col] = df_copy[col].astype(str)
    
    df_copy[col] = pd.to_datetime(df_copy[col], format='%Y-%m-%d %H:%M')
    df_copy[prefix + 'month'] = df_copy[col].dt.month
    df_copy[prefix + 'day'] = df_copy[col].dt.day
    df_copy[prefix + 'hour'] = df_copy[col].dt.hour
    df_copy[prefix + 'minute'] = df_copy[col].dt.minute
    df_copy[prefix + 'weekofyear'] = df_copy[col].dt.weekofyear
    df_copy[prefix + 'dayofyear'] = df_copy[col].dt.dayofyear
    
    return df_copy   

train_df = get_time_feature(train_df, 'time')
test_df = get_time_feature(test_df, 'time')


# 合并训练数据和测试数据
train_df['is_test'] = 0
test_df['is_test'] = 1
df = pd.concat([train_df, test_df], axis=0).reset_index(drop=True)

# 构建特征
num_cols = ['airPressure','relativeHumidity','cloudiness','10mWindSpeed','10mWindDirection',
            'temperature','irradiation','precipitation','100mWindSpeed','100mWindDirection']

for col in tqdm.tqdm(num_cols):
    # 历史平移/差分特征
    for i in [1,2,3,4,5,6,7,15,30,50] + [1*96,2*96,3*96,4*96,5*96]:
        df[f'{col}_shift{i}'] = df.groupby('stationId')[col].shift(i)
        df[f'{col}_feture_shift{i}'] = df.groupby('stationId')[col].shift(-i)

        df[f'{col}_diff{i}'] = df[f'{col}_shift{i}'] - df[col]
        df[f'{col}_feture_diff{i}'] = df[f'{col}_feture_shift{i}'] - df[col]

        df[f'{col}_2diff{i}'] = df.groupby('stationId')[f'{col}_diff{i}'].diff(1)
        df[f'{col}_feture_2diff{i}'] = df.groupby('stationId')[f'{col}_feture_diff{i}'].diff(1)
    
    # 均值相关
    df[f'{col}_3mean'] = (df[f'{col}'] + df[f'{col}_feture_shift1'] + df[f'{col}_shift1'])/3
    df[f'{col}_5mean'] = (df[f'{col}_3mean']*3 + df[f'{col}_feture_shift2'] + df[f'{col}_shift2'])/5
    df[f'{col}_7mean'] = (df[f'{col}_5mean']*5 + df[f'{col}_feture_shift3'] + df[f'{col}_shift3'])/7
    df[f'{col}_9mean'] = (df[f'{col}_7mean']*7 + df[f'{col}_feture_shift4'] + df[f'{col}_shift4'])/9
    df[f'{col}_11mean'] = (df[f'{col}_9mean']*9 + df[f'{col}_feture_shift5'] + df[f'{col}_shift5'])/11
    
    df[f'{col}_shift_3_96_mean'] = (df[f'{col}_shift{1*96}'] + df[f'{col}_shift{2*96}'] + df[f'{col}_shift{3*96}'])/3
    df[f'{col}_shift_5_96_mean'] = (df[f'{col}_shift_3_96_mean']*3 + df[f'{col}_shift{4*96}'] + df[f'{col}_shift{5*96}'])/5
    df[f'{col}_future_shift_3_96_mean'] = (df[f'{col}_feture_shift{1*96}'] + df[f'{col}_feture_shift{2*96}'] + df[f'{col}_feture_shift{3*96}'])/3
    df[f'{col}_future_shift_5_96_mean'] = (df[f'{col}_future_shift_3_96_mean']*3 + df[f'{col}_feture_shift{4*96}'] + df[f'{col}_feture_shift{5*96}'])/3
    
    # 窗口统计
    for win in [3,5,7,14,28]:
        df[f'{col}_win{win}_mean'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').mean().values
        df[f'{col}_win{win}_max'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').max().values
        df[f'{col}_win{win}_min'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').min().values
        df[f'{col}_win{win}_std'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').std().values
        df[f'{col}_win{win}_skew'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').skew().values
        df[f'{col}_win{win}_kurt'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').kurt().values
        df[f'{col}_win{win}_median'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').median().values
        
        df = df.sort_values(['stationId','time'], ascending=False)
        
        df[f'{col}_future_win{win}_mean'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').mean().values
        df[f'{col}_future_win{win}_max'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').max().values
        df[f'{col}_future_win{win}_min'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').min().values
        df[f'{col}_future_win{win}_std'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').std().values
        df[f'{col}_future_win{win}_skew'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').skew().values
        df[f'{col}_future_win{win}_kurt'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').kurt().values
        df[f'{col}_future_win{win}_median'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').median().values
        
        df = df.sort_values(['stationId','time'], ascending=True)
        
        # 二阶特征
        df[f'{col}_win{win}_mean_loc_diff'] = df[col] - df[f'{col}_win{win}_mean']
        df[f'{col}_win{win}_max_loc_diff'] = df[col] - df[f'{col}_win{win}_max']
        df[f'{col}_win{win}_min_loc_diff'] = df[col] - df[f'{col}_win{win}_min']
        df[f'{col}_win{win}_median_loc_diff'] = df[col] - df[f'{col}_win{win}_median']
        
        df[f'{col}_future_win{win}_mean_loc_diff'] = df[col] - df[f'{col}_future_win{win}_mean']
        df[f'{col}_future_win{win}_max_loc_diff'] = df[col] - df[f'{col}_future_win{win}_max']
        df[f'{col}_future_win{win}_min_loc_diff'] = df[col] - df[f'{col}_future_win{win}_min']
        df[f'{col}_future_win{win}_median_loc_diff'] = df[col] - df[f'{col}_future_win{win}_median']
        
for col in ['is_precipitation']:
    for win in [4,8,12,20,48,96]:
        df[f'{col}_win{win}_mean'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').mean().values
        df[f'{col}_win{win}_sum'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').sum().values


train_df = df[df.is_test==0].reset_index(drop=True)
test_df = df[df.is_test==1].reset_index(drop=True)
del df
gc.collect()

train_df = train_df[train_df['power']!='<NULL>'].reset_index(drop=True)
train_df['power'] = train_df['power'].astype(float)
cols = [f for f in test_df.columns if f not in ['time','power','is_test']] # capacity
def cv_model(clf, train_x, train_y, test_x, capacity, seed=2024):
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    oof = np.zeros(train_x.shape[0])
    test_predict = np.zeros(test_x.shape[0])
    cv_scores = []
    
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        # 转化目标,进行站点目标归一化
        trn_y = trn_y / capacity[train_index]
        val_y = val_y / capacity[valid_index]
        
        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)
        params = {
            'boosting_type': 'gbdt',
            'objective': 'regression',
            'metric': 'rmse',
            'min_child_weight': 5,
            'num_leaves': 2 ** 8,
            'lambda_l2': 10,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.8,
            'bagging_freq': 4,
            'learning_rate': 0.1,
            'seed': 2023,
            'nthread' : 16,
            'verbose' : -1,
        }
        model = clf.train(params, train_matrix, 3000, valid_sets=[train_matrix, valid_matrix],
                          categorical_feature=[], verbose_eval=500, early_stopping_rounds=200)
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        
        oof[valid_index] = val_pred
        test_predict += test_pred / kf.n_splits
        
        score = 1/(1+np.sqrt(mean_squared_error(val_pred * capacity[valid_index], val_y * capacity[valid_index])))
        cv_scores.append(score)
        print(cv_scores)
        
        if i == 0:
            imp_df = pd.DataFrame()
            imp_df["feature"] = cols
            imp_df["importance_gain"] = model.feature_importance(importance_type='gain')
            imp_df["importance_split"] = model.feature_importance(importance_type='split')
            imp_df["mul"] = imp_df["importance_gain"]*imp_df["importance_split"]
            imp_df = imp_df.sort_values(by='mul',ascending=False)
            imp_df.to_csv('feature_importance.csv', index=False)
            print(imp_df[:30])
            
    return oof, test_predict

lgb_oof, lgb_test = cv_model(lgb, train_df[cols], train_df['power'], test_df[cols], train_df['capacity'])

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/553231.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

文献速递:深度学习肝脏肿瘤诊断---双能量CT深度学习放射组学预测大梁状大块型肝细胞癌

Title 题目 Dual-Energy CT Deep Learning Radiomics to Predict Macrotrabecular-Massive Hepatocellular Carcinoma 双能量CT深度学习放射组学预测大梁状大块型肝细胞癌 Background 背景 It is unknown whether the additional information provided by multiparametric …

获取公募基金净值【数据分析系列博文】

摘要 从指定网址获取公募基金净值数据&#xff0c;快速解析并存储数据。 &#xff08;该博文针对自由学习者获取数据&#xff1b;而在投顾、基金、证券等公司&#xff0c;通常有Wind、聚源、通联等厂商采购的数据&#xff09; 导入所需的库&#xff1a;代码导入了一些常用的库…

OpenCV从入门到精通实战(五)——dnn加载深度学习模型

从指定路径读取图像文件、利用OpenCV进行图像处理&#xff0c;以及使用Caffe框架进行深度学习预测的过程。 下面是程序的主要步骤和对应的实现代码总结&#xff1a; 1. 导入必要的工具包和模型 程序开始先导入需要的库os、numpy、cv2&#xff0c;同时导入utils_paths模块&…

PACNet CellNet(代码开源)|bulk数据作细胞分类,评估细胞命运性能的一大利器

文章目录 1.前言2.CellNet2.1CellNet简介2.2CellNet结果 3.PACNet3.1安装R包与加载R包3.2加载数据3.3开始训练和分类3.4可视化分类过程3.5可视化分类结果 4.细胞命运分类和免疫浸润比较 1.前言 今天冲浪看到一个细胞分类性能评估的R包——PACNet&#xff0c;它与转录组分析方法…

【经验总结】Jupyter 配置内核

1. 背景描述 使用 国家超算互联网中心 的服务器&#xff0c;创建 jupyterlab 容器&#xff0c;想在之前 conda 创建的环境中运行&#xff0c;可是不行&#xff0c;进入容器就直接进入 jupyterlab 2. 解决方法 配置内核 2.1 激活环境 conda activate peft2.2 安装内核 pip…

vector类——常用函数模拟(C++)

在上一篇中我们介绍了 string 类的常用函数模拟&#xff0c;接下来我们将开始讲解 vector 类的常用函数的讲解以及模拟实现&#xff0c;相较于 string 来说&#xff0c;vector 的函数不那么冗余&#xff0c;用法也没有那么多&#xff0c;但是在 vector 中的函数使用和模拟中&am…

单链表的实现(单链表的增删查改)

在顺序表中实现数据的增删的操作时&#xff0c;都要把操作位置之后的数据全部移动一遍&#xff0c;操作效率低下。其次是容量固定&#xff08;静态顺序表&#xff09;&#xff0c;虽然在动态顺序表中容量可变&#xff0c;但也会造成空间上的浪费。 单链表就完美解决了上述缺点…

微服务架构与Dubbo

一、微服务架构 微服务架构是一种架构概念&#xff0c;旨在通过将功能分解到各个离散的服务中以实现对解决方案的解耦。 分布式系统式若干独立系统的集合&#xff0c;但是用户使用起来好像是在使用一套系统。 和微服务对应的是单体式开发&#xff0c;即所有的功能打包在一个WAR…

No spring.config.import property has been defined

运行Springcloud项目出现下面错误&#xff1a; Description: No spring.config.import property has been defined Action: Add a spring.config.importnacos: property to your configuration. If configuration is not required add spring.config.importoptional:nac…

C 排序算法

冒泡排序 冒泡排序&#xff08;英语&#xff1a;Bubble Sort&#xff09;是一种简单的排序算法。它重复地走访过要排序的数列&#xff0c;一次比较两个元素&#xff0c;如果他们的顺序&#xff08;如从大到小、首字母从A到Z&#xff09;错误就把他们交换过来。 过程演示&…

校园综合服务平台V3.9.2 源码修复大部分已知BUG

校园综合服务平台&#xff0c;版本更新至V3.9.1 &#xff0c;源码功能强大&#xff0c;ui 精美&#xff0c; 功能包含但不限于校园跑腿&#xff0c;外卖&#xff0c;组局&#xff0c;圈子&#xff0c;商城&#xff0c;抽奖&#xff0c;投票&#xff0c;团购&#xff0c;二手市场…

ROS学习笔记(12)AEB和TTC的实现

0.前提 在自动驾驶领域有许多关于驾驶安全的措施AEB和TTC就是为了驾驶安全而设计出来的。在这篇文章中我会讲解我对AEB和TTC算法的一些理解。本期ROS学习笔记同时也是ros竞速小车的学习笔记&#xff0c;我会将我的部分代码拿出来进行讲解&#xff0c;让大家更好的理解ttc和aeb…

Zabbix监控系统

一.监控软件的作用: 作为一个运维&#xff0c;需要会使用监控系统查看服务器状态以及网站流量指标&#xff0c;利用监控系统的数据去了解上线发布的结果和网站的健康状态 利用一个优秀的监控软件&#xff0c;我们可以&#xff1a; 对系统不间断实时监控实时反馈系统当前状态…

挣钱新玩法,一文带你掌握流量卡推广秘诀

手机流量卡推广项目是什么&#xff1f;听名字我相信大家就已经猜出来了&#xff0c;就是三大运营商为了开发新用户&#xff0c;发起的有奖推广活动&#xff0c;也是为了长期黏贴用户。在这个活动中&#xff0c;用户通过我们的渠道&#xff0c;就能免费办理低套餐流量卡&#xf…

链表OJ - 7(链表的回文结构)

题目描述&#xff08;来源&#xff09; 对于一个链表&#xff0c;请设计一个时间复杂度为O(n),额外空间复杂度为O(1)的算法&#xff0c;判断其是否为回文结构。 给定一个链表的头指针A&#xff0c;请返回一个bool值&#xff0c;代表其是否为回文结构。保证链表长度小于等于900。…

【SVG】从零开始绘制条形图

效果图 定义背景色和坐标轴颜色 :root {--cord-color: #2be7ca; }body {background-color: #000;}画坐标轴 画X轴 <!-- 坐标轴 --> <g id"cordinate"><!-- x轴 --><line x1"50" y1"600" x2"900" y2"600&q…

同城货运系统的开发与货运搬家软件的技术性探讨和市场分析

一、市场前景展望 随着城市化进程的加快和电商物流的蓬勃发展&#xff0c;同城货运市场展现出了巨大的潜力。尤其是在快节奏的生活环境中&#xff0c;个人和企业对于快速、便捷、可靠的货运搬家服务需求日益增长。同城货运系统与货运搬家软件作为连接货主与货运司机的桥梁&…

Opengl 坐标系统概述

1.谈到opengl 坐标系统 首先要知道三个坐标转换矩阵&#xff0c;模型矩阵&#xff0c;观察矩阵&#xff0c;投影矩阵。 模型矩阵作用在将以物体中心为原点的坐标系统&#xff0c;转换到世界坐标。 观察矩阵作用在将世界坐标系统转换到观察坐标系统 投影矩阵作用在将观察坐标…

2024年苹果审核4.3相关问题综述

苹果审核中的4.3问题是开发者关注的焦点之一&#xff0c;本文对此进行了综述&#xff0c;总结了不同情况下的处理方式和优化策略。 第一种4.3 该类问题常见于代码或UI的重复率过高&#xff0c;苹果会直接拒绝应用。开发者需注意避免此类情况的发生&#xff0c;特别是在更新应…

亚信安全数据安全运营平台DSOP新版本发布 注入AI研判升维

在当今快速发展的数字经济时代&#xff0c;企业对于数据的依赖日益加深&#xff0c;数据安全已成为企业的生命线。亚信安全推出数据安全运营平台DSOP全新版本&#xff0c;正是为满足企业对数据安全的高度需求而设计。这款平台以其卓越的能力和技术优势&#xff0c;为企业的数据…
最新文章