In Data Science world, engineering the data is the key factor, if you try to fit the uncompleted dataset into an Machine learning algorithms it displays errors and improper predictions. assuming Data science is 100 % of any problem, 70 % times goes into data engineering, 20 % time goes into Identifying and improving the model accuracy and remaining 10% of effort goes into restructuring the data.
This article talks about the main effort which is Data Engineering constituting the 70 % of the effort in Data science. But what exactly is Data Science , Data Engineering , Artificial intelligence and Machine Learning.
Data Science : Discovery of data Insights using sophisticated software and tools.
Data Engineering : Data Engineering is a process of cleansing, analysing the data, it is a process where you try to replace the NULL values with a wise guess such as mean , median, mode, interpolate and etc..
Artificial Intelligence : It is the intelligence demonstrated by machines in contrast to Natural Intelligence.
Machine Learning : Machine learning is the subset of Artificial intelligence that provides the systems the ability to automatically learn and improve from the experience without being explicitly programmed.
Let’s take an example of a data-set which has incomplete / improper data and we try to clean it and replace the data with a wise and a intelligent guess. Here we use Python pandas as a software tool to cleanse the data-set and make it proper.
What is Pandas ?
Python pandas is similar to SQL Language, It identifies the data-set as rows and columns , it recognise the fields data types, we can quey them insert , update, merge the data. It does also perform joins.
Lets process this data set
In the above figure we are importing the pandas library using the import statement
We are trying to read the data-set ( stock_data.csv) into a DataFrame ( data)
Let’s display the contents of the DataFrame
Let’s find the Datatypes of each field
The above command data.info() displays the count of the fields and the datatypes Here the object data type is equivalent to string. The Date field here is a string, this needs to be converted to a Datetime data type.
Now we have converted the Date to datetime64 datatype. Now all the datatypes are perfect . how about data lets check it.
1. We have dates missing on 2 and 3 of January ( 1/2/2017 and 1/3/2017).
2. We have NULL values for Open price, Close price and Summary
- NAN – Not a Number
3. Even for the Summary field it displays as NAN , which needs to be corrected.
Lets fill the Numerical data with 0.0 and summary field with No Data
But the above method does not seem a wise guess let’s have the previous value for the NAN value.
We can also have the backward fill
Now Lets add Date Range missing values Jan 2 and Jan 3
Lets use interpolate method to add a wise guess
Now we see the Summary field with Null Values , we will fill those, the Data we filled might not be the exact value , but these values may be nearer to the actual values.
Now we have cleansed the data-set , this data-set can now be used for Data Modelling using Machine Learning