Beginner’s Guide to Data Science Pipeline

Naresh Thakur
5 min readApr 2, 2020

--

Data modeling is often the core of data science. But, data science isn’t limited to modeling alone. Data modeling is just 20% of the complete data science pipeline. In order to extract any ‘value’ from data, it needs to be gathered, scrubbed, and explored, with motivation (to solve a real-world problem) and business domain knowledge serving as guiding forces for a data scientist.

Metaphorically, data science is like wizardry (to predict) and deduction (to compare and interpret). As an aspiring data scientist, you’d want to have the ability to auto-magically predict outcomes and identify previously unknown trends and patterns in your data.

This is where a data science pipeline comes into play.

Understanding ‘how the data science pipeline works’ is the first step towards solving a real-world problem.

Here in this post, we will discuss the steps involved in a data science pipeline that you need to follow to build a product, ready for use by end-users.

  1. Understanding the Problem

Either you have a problem or you need to define a problem statement before you even begin using data science. You need to first define and understand the problem that you’re trying to solve. An actionable insight or a product can be only as good as your understanding of the problem.

A thorough understanding of the domain or business is required in dissecting the problem.

The model you intend to build by the end of the data science pipeline will depend completely on the problem at hand. For different requirements and objectives, you’d have to adjust your algorithms. A one-size-fits-all approach does not work.

Example Scenario: Consider, for instance, that you’re building a recommendation engine for an eCommerce portal. The objective is to recommend products to all new visitors on the platform. The business goal is to get a first-time visitor to spend maximum time on the platform and place her first order. But if you build a system for both new and returning visitors, it’s of no use. And if the recommendation engine fails to identify patterns in how new visitors explore different products and place their first order, it’ll provide no value to the business organization. This is why understanding the problem and the domain is crucial for building a useful data science product.

2. Data Collection

Data is collected based on your understanding of the problem. Data collection is a tedious and time-consuming process. It demands patience, energy, and time.

With more data, it is possible to build more robust models.

It is paramount to work on accurate data in order to build reliable models. If there are too many data-point outliers, even the most refined models are destined to fail.

Example Scenario: You will collect datasets pertaining to first-time visitors as well as key events and actions. For instance, you will track where they click or how they explore various products on the platform. If you use data of returning visitors, you’d be adding noise to the data.

Skills Required:

Querying relational and non-relational databases: MySQL, PostgresSQL, MongoDB

Distributed Storage: Hadoop, Apache Spark

Retrieving Unstructured Data: text, images, videos, audio files, documents, excel etc

3. Data Cleaning

This phase of the data science pipeline generally requires the most time and effort. The outcomes and output of a data science model are only as good as the data you put into it. Scripting languages such as Python and R are used for data cleaning.

The collected data is examined, scrubbed, and stored in a structured form. The key objective is to remove as much noise as possible during this phase; domain knowledge and understanding of the business problem help in identifying and removing outliers.

The data thus cleaned will be used for exploratory data analysis and modeling in the next steps.

Example Scenario: All data, which adds noise and isn’t tied to the business needs, related to the problem at hand needs to be removed. When you examine the data, you need to identify corrupt records, errors, and missing values. During scrubbing, datasets with errors or missing values are thrown away, replaced or filled [e.g. with NA (Not Applicable)].

Skills Required:

Scripting language: Python or R

Data Wrangling Tools: Python Pandas, R

4. Exploratory Data Analysis

Now that you’ve clean data available, it is time to explore it!

During this phase, the goal is to extract insights and identify hidden patterns from the data and map them to the business and the specific problem that needs to be solved.

As in the previous steps, a good understanding of the domain helps steer data analysis in directions where you are more likely to discover useful information and insights related to the data.

Example Scenario: In the example discussed in Step 1, based on your understanding of seasonal trends in the eCommerce market, you may discover that half of the first-time website visitors during the summer period spent more than three minutes checking refrigerators.

You practically need to develop a sense to spot weird or interesting patterns/trends during exploratory data analysis.

Visualization tools are helpful in extracting patterns through charts and visualizations; statistical testing methods come handy in extracting features and backing up findings with graphs and analyses.

Based on the analyses, new features can be created at this stage, if required.

Skills Required:

Some popular visualization libraries used for exploratory data analysis include Matplotlib, Seaborn, Numpy, Pandas, Scipy in Python and GGplot2 in R

5. Data Modeling

Now, it is time to solve the problem by using Machine Learning and Deep Learning algorithms. This is the most exciting phase of the entire data science pipeline.

Different methods/algorithms are tested. The method that delivers the best performance (in terms of predictive analytics) is selected. The model is refined and evaluated many times over.

Your model’s predictive power will depend on the quality of the features that you use.

Example Scenario: Your data model for the recommendation engine may predict that at least one item from a combination of certain kitchen appliances, groceries, and grooming products is likely to be purchased by a first-time visitor.

Scikit-learn (Python) and CARET (R) libraries can be used for building Machine Learning models. Among the various Deep Learning frameworks available nowadays, Keras/TensorFlow can be used for building Deep Learning models. Compare frameworks in various aspects before you pick one.

6. Deployment

Now that the model is ready, it is time to make it accessible to end-users.

The model should be scalable. When new data is available, the model can be reevaluated and updated.

Final Words

It is important that your data science pipeline is solid from start to end. Each step is important.

Follow me on: LinkedIn. Twitter.

If you’ve any questions about the steps outlined above, need help with a related topic, please drop a line in the comment section below. I will respond to your queries as soon as possible.

--

--

Naresh Thakur
Naresh Thakur

Written by Naresh Thakur

Director of Engineering | A result-oriented professional with over 15 years of experience in software engineering.

No responses yet