So we promise you, they will never ever, ever, ever turn evil — “The Mitchells vs the machines”

If you think I am speaking in my wild imagination on what artificially intelligent devices can do to us, you are wrong.

I recently watched the movie “The Mitchells vs the machines” and was overwhelmed to see the portrayal of the artificially intelligent robots taking over the world by putting humans into pods. These pods will then be put into space forever and the world will be run by the machines thereafter.

It is neither the happy ending nor the world we…

Responsible AI, Ethical AI, AI for social good — I am sure you must have heard these terms at some point or the other, whether you are a Data Scientist or not.

When I first heard these terms, the warning from Prof Stephen Hawking rang in my ears:

“The development of full artificial intelligence could spell the end of the human race.”

And there my journey of understanding this critical aspect of the AI foundation started. …

If you also carry a vision of ensuring that the product you are working on follows all the written rules of “AI for good”, then you would have definitely encountered a situation where your data is biased.

Biased models, biased data, or biased implementation — are typical woes of a Data Scientist’s life. So first we need to understand and acknowledge that bias exists and can take any shape and form.

Yes, bias is a broad term and it can be present in the data collection, algorithm, or even at the ML output interpretation stage.

Principles and benefits of going cloud-native

Cloud-native has changed the dynamics of the software industry and how people think of deploying and operating software applications.

**As per ****Wikipedia****, **Cloud-native computing is an approach in software development that utilizes cloud computing to “build and run scalable applications in modern, dynamic environments such as **public, private, and hybrid clouds”.**

In short, it lets you build and run the applications based on the cloud computing delivery model.

Let’s develop a little background on what is cloud computing — “It refers to the on-demand availability of resources like cloud-storage and computes power without the…

**Time-varying data characteristics**

One of the most critical assumptions in ML data modeling is that the train and test dataset belong to similar distribution. This emphasizes the property of generalization of ML solution

Based on the generalization property, a machine learning model learns the association between the independent features and the target variable from the train data and predicts unseen data (we will call it test data in the rest of the article).

Note that the train data is used as a reference to estimate the target value for the test data. …

You must be thinking why I am calling this article **Binging with Statistics**? Well, we all have been binging on Netflix, hot star, amazon prime and what not while struggling amid pandemic.

So, I thought to do something different and binge on some key statistics terms and share them with you. Some productive way to binge, right?

Well, I hope it turns out to be a good binge time for you as well 😊

Here we go:

**Statistical Significance:**

Statistical Significance suggests that the result of an experiment is owed to a specific cause and is not a pure chance…

Sharing my solution to help you kickstart your hackathon journey

I recently participated in MachineHack’s **Buyer’s Time Prediction Challenge** and would like to share my approach with you. So, let's get started with a quick outline:

- Problem Statement
- Data understanding
- Solution: a) Target variable transformation, b) Outlier removal, c) Feature Engineering, and d) Modeling
- Learning from peers

**Problem Statement:**

The competition focused on developing a machine learning model to buyers’ time spent on an eCommerce platform.

**Evaluation metric:** Root mean squared logarithmic error

**Target variable:** “time_spent”

Let us quickly **understand the data:**

We have 9 features in total as shown…

Sufficiency, Robustness, and more

In this article, I will talk about the various properties of a statistic.

Please see post1 for an introduction to statistical modeling and post2 to understand the difference between a statistic and an estimator.

Let’s revisit **what is a statistic?**

A statistic is a single value that is calculated from the sample data.

**What is it used for?**

Well, as the definition suggests, it is used to describe a sample of data besides estimating a population parameter or evaluating a hypothesis.

**Why do we need to study it?**

We need a statistic of the sample data…

A quick aggregation of data quality checks

Machine Learning practitioners spend a lot of their time with data as the data quality issues inherently deter the learning of any algorithm. It follows the “Garbage in, garbage out” principle, i.e. poor data quality can only result in poor learning. Good quality data is crucial to the success of any machine learning algorithm.

You must be thinking why I am emphasizing data quality so much, so let us look at the definition of ‘Machine Learning’ and infer its significance:

As per Wikipedia:

*“Machine Learning involves computers learning from data provided so that…*

and the properties of an estimator

We often come across terms like a statistic and an estimator while working on statistical problems in the realm of Machine Learning. It is important to understand the difference between the two so that we know the context behind such terms in parameter learning. But, before understanding the difference between a statistic and an estimator, let’s first understand a few concepts on the properties of a statistical model.

So, let’s get started.

Identifiability: An estimator is said to be identifiable if and only if the following mapping is one-to-one: