iCleaner: A Data Cleansing Tool for Outlier Detection in a Data Warehousing Environment
Kofi Sarpong Adu-Manu *
Department of Computer Science/Information Technology, Valley View University, Accra, Ghana.
John Kingsley Arthur
Department of Computer Science/Information Technology, Valley View University, Accra, Ghana.
Joseph Kobina Panford
Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana.
Joseph George Davis
Department of Computer Science, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana.
*Author to whom correspondence should be addressed.
Abstract
The implementation of Data Cleansing (DC) in Data Warehousing (DW) is essential in recent years. Organizations around the world generate huge amount of data from their day-to-day activities for their operations. These organizations will not survive if the data they generate remains dirty or erroneous. There are errors or outliers that make the data become dirty such as data entry errors, outdated data in the database, data migrated from old databases, and changes made at the source repository. The changing needs by customers to update their records (for example customer attributes such as marital status, phone number or address changes with time) cause records to become obsolete and reducing the quality of data. In order to obtain high quality data over time, the data requires cleansing. The proposed Integrated Cleaning (iCleaner) tool is developed to facilitate the data cleaning process and to address the problems associated with duplicated records. In addition, the proposed cleaning tool is able to detect and update missing data by merging key columns within the records. The system is flexible to use and comes with a convenient user-friendly interface designed for the data cleansing process. We provide an efficient, but simple algorithms designed to perform these functionalities and provide the running time for the system performance.
Keywords: Data cleansing, data quality, database system management, data warehousing, automatic periodic scheduler, duplicates, missing data, data integration, data quality tools.