About Covid-19

Image from https://cxm.world

Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.

Most people infected with coronavirus will not require any treatment, as they will experience respiratory illness and will recover automatically. Those who develop serious illness will mostly be older people and people with medical conditions like diabetes, respiratory diseases, and cancer.

The virus spreads through droplets of saliva or nasal discharge from an infected person when he coughs or sneezes. The best way to prevent yourself from virus infection is to wash hands frequently with soap and water for at least 20 seconds or use any alcohol based sanitizer.

As a part of self-learning Spark with python, I have analyzed and visualized the covid-19 data using the Plotly and Spark.


Plotly Express is a new high-level Python visualization library: it’s a wrapper for Plotly.py that exposes a simple syntax for complex charts.


Apache Spark™ is a unified analytics engine for large-scale data processing.

Data Collection

Data has been collected by using the COVID-19 India API from the third party website. Below are the details of the APIs used.

Image by author

Note – Please refer to the original website for any change in the APIs.

Some imports for the main libraries and Utility functions.

Function to get covid-19 api details

Function to transpose columns to rows based on columns


Using Spark

Apache spark can read data in various data formats like CSV, JSON, parquet and Avro. Here the data is in JSON format and has to be converted to a spark dataframe, so that spark can work on data in a distributed way. Pyspark library in python is being used to interact with spark.


Data Visualization

Now the dataframe is ready to do exploratory data analysis.

Note : The data shown below are as of 2nd week of August 2020

Consolidated Analysis

Total cases of India

Upon analysing the data, it shows that only 9% of total tested cases are getting confirmed. The recovery rate rose to 73.18 per cent and fatality rate stands at 1.92 per cent with the current data.

Image by author

State-wise Recovery and Death rate

The analysis shows that the recovery rate is way higher than the mortality rate.

Image by author


State-wise total Confirmed, Active, Death report

By analysing the top 10 state data, Maharastra state has more cases compared to other states of India.

Image by author

Day wise Analysis

For analysing daily counts , the data needs to be fetched from a different API.

Daily Confirmed vs Recovered vs Deceased count report

From the analysis of data, it is clear that the person getting infected and getting recovered is increasing day by day.

Image by author

Confirmed cases of Top States

After analysing the data, it clearly depicts that there is a steady increase in confirmed cases in the main states of India.

Image by author

Monthly cases report of Indiain states

Upon analysing the data, it is clear that there is an increase in total counts from June to July.

Image by author

The first confirmed case in India was on January 30, 2020. Upon analysing all the graphs, it is clear that  there has been a consistent rise in the number of cases within the country since then.