What's a unique challenge you've faced when cleaning a messy dataset, and how did you overcome it?

Question

Alex Endacott · Accepted Answer

There was one project where we had to read in CSV data from files that were automatically generated by a 3rd party vendor application, and the CSV files had a variety of formatting issues that caused our standard ETL process to break. The solution we developed was to read in and validate the files at a granular, row-by-row, field-by-field process. Precise rules for the length of rows and the expected values were developed, and cases were identified where we could automatically correct the data, and others where there was no clear fix and the row had to be ignored. The process allowed us to salvage greater than 95% of the vendor data.

Abid Salahi · Answer

One unique challenge that I grappled with was handling a dataset suffering from the 'curse of dimensionality'. We had collected information across multiple dimensions and while more data usually promises better insights, this time, it was like navigating through a dense forest; it cluttered our analysis. It was a mathematician’s labyrinth! The solution came by implementing Dimensionality Reduction Techniques like Principal Component Analysis that helped to reduce redundancy and noise. We managed to prune that forest down to a manageable, insightful grove!

What's a unique challenge you've faced when cleaning a messy dataset, and how did you overcome it?

2 Answers

Alex Endacott

Abid Salahi

Related Questions

What's a unique challenge you've faced when cleaning a messy dataset, and how did you overcome it?

2 Answers

Alex Endacott

Abid Salahi