[1]袁建裕,闫春艳,叶志伟,等. 离散型缺失数据填补法综合比较[J].湖北工业大学学报,2021,(5):59-63.
 YUAN Jianyu,YAN Chunyan,YE Zhiwei,et al. Comprehensive Comparison of Methods for Imputing Discrete Missing Data[J].,2021,(5):59-63.
点击复制

 离散型缺失数据填补法综合比较()
分享到:

《湖北工业大学学报》[ISSN:1003-4684/CN:42-1752/Z]

卷:
期数:
2021年第5期
页码:
59-63
栏目:
湖北工业大学学报
出版日期:
2021-10-31

文章信息/Info

Title:
 Comprehensive Comparison of Methods for Imputing Discrete Missing Data
文章编号:
1003-4684(2021)05-0059-05
作者:
 袁建裕1 闫春艳1 叶志伟1 杨志勇2
 1 湖北工业大学计算机学院, 湖北 武汉 430068;
2 湖北省公安厅科技信息处, 湖北 武汉 430064
Author(s):
 YUAN Jianyu1YAN Chunyan1YE Zhiwei1YANG Zhiyong2
 1 School of Computer Science, Hubei Univ. of Tech., Wuhan 430068, China; 
2 Hubei Provincial Public Security Department, Wuhan 430064, China
关键词:
 离散缺失数据 填补方法 方法比较
Keywords:
 missing data data imputation discrete data method comparison
分类号:
TP391.9
文献标志码:
A
摘要:
 针对离散型数据填补方法的研究尚不完备的情况,通过改造现有模型,系统地比较和分析了基于众数填补、随机填补、K最近邻填补、基于自编码器的填补和基于生成对抗网络的填补在离散型数据的填补性能,对在数据预处理阶段选择适合数据集的填补方案具有重要的意义。实验结果显示,不同填补方法的填补结果有较大的差异,进而影响后续分析的准确性。
Abstract:
 In the process of data mining, the widespread problem of missing data would largely affect the data quality and the robustness of analyses, and ultimately lead to biased decision-making. The commonly used imputation methods are mainly targeted at continuous data, most of which are not suitable for discrete data. However, there still lacks comprehensive research on discrete data imputation methods both at home and abroad. To this end, we have systematically estimated the discrete data imputation performance of several methods, including mode-based filling, random filling, K nearest neighbor filling, auto-encoder-based filling, and generative confrontation based filling, by modifying the existing models to fit discrete data. The results indicated that the performances varied largely among different filling methods, which in turn affects the accuracy of subsequent analyses. Therefore, it is crucial to choose suitable imputation scheme according to different data sets during the data preprocessing stage.

参考文献/References:

[1] BIG Data Center Members. Database resources of the BIG data center in 2019[J]. Nucleic Acids Res, 2019, 47(1):8-14. 
[2] CARLSON, DAVID, LAWRENCE CARIN. Continuing progress of spike sorting in the era of big data[J]. Current opinion in neurobiology, 2019, 55(4):90-96.
[3] ZHANG, ZHONGHENG. Missing data imputation: focusing on single imputation[J]. Annals of translational medicine, 2016, 4(1):9.
[4] 熊中敏,郭怀宇.缺失数据处理方法研究综述[J/OL].计算机工程与应用:1-13[2021-06-03].http://kns.cnki.net/kcms/detail/11.2127.TP.20210508.1003.004.html.
[5] 陈娟,王献雨,罗玲玲,崔晶晶.缺失值填补效果:机器学习与统计学习的比较[J].统计与决策, 2020, 36(17):28-32.
[6] RUBIN, DONALD B. Inference and missing data[J]. Biometrika,1976,63(3):581-592.
[7] ALAYA M Z, BUSSY S, S GAFFAS, et al. Binarsity: a penalization for one-hot encoded features[J]. Journal of Machine Learning Research, 2017, 20:1-34.
[8] YAGYANATH R. Multivariate imputation for missing data handling a case study on small and large data sets[J]. International Journal of Human Computing Studies, 2020, 2(1):5-11.
[9] 薛洁,吴霞,姚雨萌.我国五大热门城市住房分享发展现状分析——基于爱彼迎中国平台数据[J].杭州电子科技大学学报(社会科学版), 2019, 15(3):26-32.
[10] AL-ZOUBI A, TATAS K,KYRIACOU C. Design space exploration of the KNN imputation on FPGA[C]. 2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST). IEEE, 2018:1-4.
[11] TAHERI R, GHAHRAMANI M, JAVIDAN R, et al. Similarity-based Android malware detection using Hamming distance of static binary features[J]. Future Generation Computer Systems, 2020, 105: 230-247.
[12] GU S, KELLY B, XIU D. Autoencoder asset pricing models[J]. Journal of Econometrics, 2021, 222(1): 429-450.
[13] LUO Y, CAI X, ZHANG Y, et al. Multivariate time series imputation with generative adversarial networks[C]. Proceedings of the 32nd International Conference on Neural Information Processing Systems,2018: 1603-1614.

相似文献/References:

[1]熊韧,曹海印,王焱清,等.非牛顿润滑静压轴承的节流器流量方程修正[J].湖北工业大学学报,2019,34(5):6.
 XIONG Ren,CAO Haiyin,WANG Yanqing,et al.Modified restrictor flow equations of hydrostatic bearings ubricated by non-Newtonian fluids[J].,2019,34(5):6.
[2]王照远,曹 民,王 毅,等. 场景与数据双驱动的隧道图像拼接方法[J].湖北工业大学学报,2020,(4):11.
 WANG Zhaoyuan,CAO Min,WANG Yi,et al. Tunnel Image Stitching Method based on Scene and Data[J].,2020,(5):11.
[3]潘 健,梁佳成,陈凤娇,等. 单电流闭环多重PR控制的LCL型逆变器[J].湖北工业大学学报,2020,(4):16.
 PAN Jian,LIANG Jiacheng,CHEN Fengjiao,et al. Design of LCL Grid Connected Inverter based on Single Closed Loop Control and Multiple PR Controllers[J].,2020,(5):16.
[4]王晓光,赵 萌,文益雪,等. 定子闭口槽结构对永磁电机齿槽转矩影响分析[J].湖北工业大学学报,2020,(4):25.
 WANG Xiaoguang,ZHAO Meng,WEN Yixue,et al. Study on Cogging Torque and Vibration Noise of Permanent Magnet Motor with Segmental Stator and Closed-Slot[J].,2020,(5):25.
[5]宇 卫,凃玲英,陈 健. 风电场集中接入对集电线电流保护的影响[J].湖北工业大学学报,2020,(4):29.
 YU Wei,TU Lingying,CHEN Jian. Effect of the Collective Line Current Protection when Wind Farms are Centralized Accessed to the Power System[J].,2020,(5):29.
[6]廖政斌,王泽飞,祝 珊. 二惯量系统谐振在线抑制及相位补偿[J].湖北工业大学学报,2020,(4):34.
 LIAO Zhengbin,WANG Zefei,ZHU Shan. Online Resonance Suppression and Phase Compensation for Double Inertia System[J].,2020,(5):34.
[7]王 欣,游 颖,姜天翔,等. 面向3D打印过程的产品工艺设计和优化[J].湖北工业大学学报,2020,(4):39.
 WANG Xin,YOU Ying,JIANG Tianxiang,et al. Product Process Design and Optimization for 3D Printing Processes[J].,2020,(5):39.
[8]冉晶晶,文 红,罗雅梅,等. 全自动样品前处理平台及其控制系统[J].湖北工业大学学报,2020,(4):43.
 RAN Jingjing,WEN Hong,LUO Yamei,et al. Research on Automatic Sample Preprocessing Platform and its Control System[J].,2020,(5):43.
[9]杨 磊,马志艳,石 敏,等. 基于模糊PID的小型冷库过热度控制方法[J].湖北工业大学学报,2020,(4):43.
 YANG Lei,MA Zhiyan,SHI Min,et al. Research on Superheat Control Method of Small Cold Storage based on Fuzzy PID[J].,2020,(5):43.
[10]黄 晶,周细枝,周业望. 动态注塑成型模具的设计与实验研究[J].湖北工业大学学报,2020,(4):52.
 HUANG Jing,ZHOU Xizhi,ZHOU Yewang. Design and Experimental Study of Dynamic Injection Molding[J].,2020,(5):52.

备注/Memo

备注/Memo:
 [收稿日期] 2020-06-06
[基金项目] 福建省大数据管理新技术与知识工程重点实验室开放基金(BD201801)
[第一作者] 袁建裕(1990-), 男, 浙江宁海人,湖北工业大学硕士研究生,研究方向为数据挖掘
[通信作者] 叶志伟(1978-), 男, 湖北浠水人,工学博士,湖北工业大学教授,研究方向为机器学习和数据挖掘
更新日期/Last Update: 2021-11-01