A method to solve the problem of missing data, outlier data, and noisy data to improve the performance of human and information interaction

Mazoochi, Mojtaba; Rabiei, Leila; Moradi, Mohammad

Volume 9, Issue 4 (1-2023) Human Information Interaction 2023, 9(4): 13-25 | Back to browse issues page

‎ 20.1001.1.24237418.1401.9.4.1.3

Mendeley

Zotero

RefWorks

Mazoochi M, Rabiei L, Moradi M. A method to solve the problem of missing data, outlier data, and noisy data to improve the performance of human and information interaction. Human Information Interaction 2023; 9 (4)
URL: http://hii.khu.ac.ir/article-1-3077-en.html

A method to solve the problem of missing data, outlier data, and noisy data to improve the performance of human and information interaction

Mojtaba Mazoochi

, Leila Rabiei

, Mohammad Moradi

ICT Research Institute, Tehran, Iran.

Abstract: (7058 Views)

Introduction: Errors in data collection and failure to pay attention to data that is noisy in the collection process for any reason cause problems in data-based analysis and, as a result, wrong decision-making. Therefore, solving the problem of missing or noisy data before processing and analysis is of vital importance in analytical systems. The purpose of this paper is to provide a method to identify noisy data, outliers, and missing data and provide a suitable solution for these data.
Methods: This study is applied research. Data mining techniques including binning smoothing and regression models have been used to identify and replace outlier and noisy data.
Results: The results of the tests performed in the real environment related to the data of social networks show the proper performance of the proposed method. It has also been shown that the proposed method has higher accuracy compared to the methods of binning smoothing, average and linear regression. So that for the data related to the tweet section, the mean squared error obtained for the proposed method was equal to 0.04, the binning smoothing method was equal to 0.38, the linear regression method was equal to 0.05 and the average method was equal to 0.06.
Conclusion: The method presented in this article can initially identify outlier data through one-third and two-thirds normal, and then replace the outlier data with a linear regression model, which results in improving the performance of using and processing information and improving human-information interaction

Keywords: Noisy Data, Outliers, Missing Data, Smoothing, Binning Method, Regression Model

Full-Text [PDF 804 kb] (1600 Downloads)

Type of Study: Research | Subject: Special

References

1. Aggarwal, C. C., & Yu, P. S. (2005). An effective and efficient algorithm for high-dimensional outlier detection. The VLDB journal, 14, 211-221. [DOI:10.1007/s00778-004-0125-5]

2. Arning, A., Agrawal, R., & Raghavan, P. (1996, August). A Linear Method for Deviation Detection in Large Databases. In KDD (Vol. 1141, No. 50, pp. 972-981).

3. Biessmann, F., Rukat, T., Schmidt, P., Naidu, P., Schelter, S., Taptunov, A., ... & Salinas, D. (2019). DataWig: Missing Value Imputation for Tables. J. Mach. Learn. Res., 20(175), 1-6.

4. Han, J, & Kamber, M. (2006). Data mining: con-cepts and techniques, 2nd. University of Illinois at Urbana Champaign: Morgan Kaufmann.

5. Honghai, F., Guoshun, C., Cheng, Y., Bingru, Y., & Yumei, C. (2005, September). A SVM regression based approach to filling in missing values. In In-ternational Conference on Knowledge-Based and Intelligent Information and Engineering Systems (pp. 581-587). Springer, Berlin, Heidelberg. [DOI:10.1007/11553939_83]

6. Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms. John Wiley & Sons. [DOI:10.1002/9781118029145]

7. Kiani, R., & Montazeri, M. (2015). A review of out-lier detection methods. International Conference on Research in Science and Technology, Kualalampur, Malaysia. (Persian)

8. Li, L., Zhou, H., Liu, H., Zhang, C., & Liu, J. (2021). A hybrid method coupling empirical mode decom-position and a long short-term memory network to predict missing measured signal data of SHM sys-tems. Structural Health Monitoring, 20(4), 1778-1793. [DOI:10.1177/1475921720932813]

9. Liu, Y., Dillon, T., Yu, W., Rahayu, W., & Mostafa, F. (2020). Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet of Things Journal, 7(8), 6855-6867. [DOI:10.1109/JIOT.2020.2970467]

10. Sadik, M., & Gruenwald, L. (2010, August). DBOD-DS: Distance based outlier detection for data streams. In International Conference on Database and Expert Systems Applications (pp. 122-136). Springer, Berlin, Heidelberg. [DOI:10.1007/978-3-642-15364-8_9]

11. Tada, M., Suzuki, N., & Okada, Y. (2022). Missing Value Imputation Method for Multiclass Matrix Data Based on Closed Itemset. Entropy, 24(2), 286. [DOI:10.3390/e24020286] [PMID] []

12. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., ... & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525. [DOI:10.1093/bioinformatics/17.6.520] [PMID]

13. Zhang, Y., Zhou, B., Cai, X., Guo, W., Ding, X., & Yuan, X. (2021). Missing value imputation in mul-tivariate time series with end-to-end generative adversarial networks. Information Sciences, 551, 67-82. [DOI:10.1016/j.ins.2020.11.035]

14. Zhou, X., Wang, X., & Dougherty, E. R. (2003). Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bio-informatics, 19(17), 2302-2307. [DOI:10.1093/bioinformatics/btg323] [PMID]

Send email to the article author

Rights and permissions
	This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Designed & Developed by : Yektaweb

Human Information Interaction

Related Websites

Site Keywords

Vote