A unique challenge I faced when cleaning data was dealing with inconsistent data formats from multiple sources. We were integrating data from various platforms, each using different formats for dates, currency, and even naming conventions. This inconsistency made it difficult to analyze the data accurately, as errors and discrepancies would arise during processing. To overcome this, I first standardized the data by creating a set of rules that defined the correct format for each data type. I then used data cleaning tools like Python’s Pandas library to apply these rules across the entire dataset, converting all entries into a uniform format. For example, I converted all date formats to a single standard (YYYY-MM-DD) and ensured currency values were in the same denomination. I also implemented automated scripts to identify and flag any outliers or anomalies that didn't conform to the established rules. This process allowed us to catch errors early and make corrections before the data was used for analysis. By standardizing the data formats and automating the cleaning process, we were able to achieve a high level of data accuracy, which significantly improved the quality of our insights and decision-making.
All Machine Learning (ML) and AI algorithms need data input into special-purpose calculations to build a model of the data. This dataset is called the training dataset. If bad data is included in that training set, then the model will be "misaligned". Extracting those bad data points is critical to the best effectiveness of the model. Generally, every time a model is trained, a data cleansing operation is performed. A specific example which I've encountered deals with data output from sensors and measured by analog-to-digital (A/D) converters. Since these sensors convert a physical parameter to an electrical signal, the measurements can be corrupted, mainly by two sources: electrical noise and calibration drift. The process of cleaning these measurements are different for both sources. For noise, outlier detection is the preferential method to locating and removing glitches and spikes due to noise. Outlier detection is made challenging when the parameter being measured changes over time. For example, measuring the pressure in a manifold feeding fuel to a turbine will increase or decrease depending on the required power output. A rolling median or other block filtering method is usually sufficient but there are many more sophisticated outlier detection methods to be used, all of which invoke some expectations about what the data should look like. A tried and true method is Kalman Filtering. Many others have been developed which are best used for specific situations. Calibration drift is very tricky to detect because it is usually slow relative to the physics-driven changes in the parameter being measured. However, situations that have repetitions or known behavior in that parameter can be used to detect drift. For example, knowing that the measurements of each of a sequence of parts should be the same, because the CNC machine that cut or drilled the parts has its own built-in validation methods, will enable detection of the drift over time. Or, there may be a sequence of data points in a cycle that are expected to be "zero" so drift can be detected. If drift is detected, the next step is to decide if it is significant. And, if so, you can remove all subsequent data until the sensor is recalibrated. This approach means less training data for your model but at least its clean.
One interesting problem I encountered during data scrubbing was related to the presence of different data formats for the same piece of information. For instance, while customer information from different sources was collected and entered, some records had names fully in upper case, dates in other formats, or telephone numbers with different separator characters. To tackle this, we developed a data normalization process involving the use of scripting and automation tools like Python. For instance, we wrote scripts to ensure that all formats were standardized, such as changing the case style of the entire text, standardizing date formatting, and unifying the way phone numbers were captured. In addition, we appraised each record that was going to be keyed into the central database and employed validation rules to correct mistakes made in data entry. This not only cleaned the data that had already been captured but also prevented capturing other inconsistently formatted records. It also enhanced our data's analytical and decision-making capabilities by increasing its quality. This method also helped us minimize downtime and human errors, ensuring the quality of contents in every unit system.