Before we dive into data science, let us first understand what is Data?
Data is everywhere! In today’s digital world, data is generated from various media such as mobile devices, ATMs, Airline operations and many more. Data is said to be the useful information which can be used for analysis and to gain insights from it. Data is very valuable and that’s why companies are investing majorly on it.
Data can be categorized into three types
- Structured data: The data is well structured and the values in it can be properly distinguished from each other.
- Unstructured data: The data has no fixed structure, anything which has no format defining the data storage is considered as unstructured data.
- Semi-Structured Data: The data set which has structure but does not follow a strict hierarchy of data model.
Data science is an inter-disciplinary field where the data is extracted, analysed and various algorithms are applied in order to get the useful insights of the data.
It is combination of statistics, Data Analysis, Machine learning and much more. Anything you do with data involves one of the data science processes.
Data science is a discipline which includes set of process. It includes the following:
- Data collection: Data is collected from various sources. The data may be in batches or it might be collected continuously.
- Exploratory Data Analysis: The data set collected is analysed to verify the data distribution, check for errors, outliers, as well as perform statistical analysis.
- Data Cleansing: The analysed data must be cleaned to fix the errors if any. This step is the most time-consuming step in the life cycle. Almost 60% of time is spent to clean the data, fix the errors, missing values (If possible), and outliers and get the data into right format as per needs.
- Data Modelling: The next step once the cleaned, formatted and standardized data set is generated is to fit a model as per the problem needs such as regression, classification problem. Choosing appropriate model as per the datasets is key here.
- Model Validation: Once we have decided to use a model for the data set, to accept it statistically and to verify its performance and accuracy we perform model validation which will help us understand how the model is performing on the validation data set.
- Visualization: To present the analysed, predicted results it is always preferable to use visualization tools which help us to visualize the statistical data in the form of charts, graphs. Majorly used tools are Tableau, PowerBI, RShiny, etc.
Overall Data science process involves various tasks and hence separate dedicated teams are formed to perform these tasks. So don’t assume that you will become data scientist in no time, you become data scientist with time and experience.