- Yair Mark
Let me start by making it clear I am not a data scientist - I am a complete data science noob. But I am an experienced full stack developer and have used some Python professionally. I also have a CompSci and Maths background, but I have no background at all in statistics.
This article gives some insights into my experiences with SciKit learn using real-world data for a simple model. I found that most of my time at the start of the project involved getting the data into a usable state. This article covers that.
Based on various conversations I have had with data scientists and colleagues familiar with data science, generally data clean up ends up taking the longest. One colleague jokingly said that data scientists should actually be called data janitors because of how much time is spent cleaning data.
This so far has definitely been my experience. I have been working with an extract of real-world data. But much of my initial time was spent just trying to get the data into a workable form as well as understand the data. The CSV I am working with is only 300isMB in size - which is not considered large at all for data - but it still posed some challenges.
Fixing Broken Data
The first challenge was not fixing dirty data but rather fixing completely broken data. When trying to feel out the data with tools like CSV Kit I could not even get CSV Kit to load the data due to various errors like:
csvlook ValueError: Row 6811 has 96 values, but Table only has 94 columns.
xsv found record with 2 fields, but the previous record has 1 fields
This is an error from two different CSV CLI tools but basically point to the same issue. In my case the data extract tool (that created the CSV) for some bizarre reason quoted entire rows. For example, I have a row that is correctly followed by one that is not:
Name, Surname, Age John, Smith, 21 "Sally, Smith, 19"
Debugging Broken Data
To debug these issues I could not use standard GUI text editors as 300MB+ seems to be too big for them. In the end, I opened up the file using plain old
vim in the cli which handled it without a sweat. Using vim I went straight to the line with the issue and saw an example of the above.
Fixing Broken Data
I started by writing a sed query to fix this:
sed -E 's/^"(.*)";;;;;;/\1/g' MY_DATA.csv | less
I intentionally did not use the
-i flag since I did not want this to happen in-place. Instead, I pipe the output to less for further examination. I then re-ran the above but piped it to my cli CSV tool of choice (XSV in the end which I will get into later):
sed -E 's/^"(.*)";;;;;;/\1/g' MY_DATA.csv | xsv table
This lets me see if there are any additional errors and start piping
sed commands together until I get the CSV clean enough to be read by xsv.
I kept applying this chaining sed queries together until I could successfully pipe to