# Discrete Probability Distributions in Machine Learning The probability for a discrete random variable can be summarized with a **discrete probability distribution**. Discrete probability distributions are used in machine learning in the modeling of binary and multi-class classification problems as well as in evaluating the performance for binary classification models such as the calculation of confidence intervals and in the modeling of the distribution of words in text for natural language processing. Knowledge of discrete probability distributions is also required in the choice of activation functions in the output layer of deep learning neural networks for classification tasks and selecting an appropriate loss function. ## Tutorial Overview This tutorial is divided into five parts; they are: 1. Discrete Probability Distributions 2. Bernoulli Distribution 3. Binomial Distribution 4. Multinoulli Distribution 5. Multinomial Distribution ## Discrete Probability Distribution A random variable is the quantity produced by a random process. A discrete random variable is a random variable that can have one of a finite set of specific outcomes. The two types of discrete random variables most commonly used in machine learning are binary and categorical. - Binary Random Variable: x in {0, 1} - Categorical Random Variable: x in {1, 2, …, K}. A binary random variable is a discrete random variable where the finite set of outcomes is in {0, 1}. A categorical random variable is a discrete random variable where the finite set of outcomes is in {1, 2, …, K}, where K is the total number of unique outcomes. Each outcome or event for a discrete random variable has a probability. The relationship between the events for a discrete random variable and their probabilities is called the _discrete probability distribution_ which is summarized by a probability mass function (PMF). For outcomes that can be ordered, the probability of an event equal to or less than a given value is defined by the cumulative distribution function, or CDF for short. The inverse of the CDF is called the percentage-point function and will give the discrete outcome that is less than or equal to a probability. - PMF: Probability Mass Function returns the probability of a given outcome. - CDF: Cumulative Distribution Function returns the probability of a value less than or equal to a given outcome. - PPF: Percent-Point Function returns a discrete value that is less than or equal to the given probability. ## Poisson Process A **Poisson Process** is a model for a series of discrete event where the average time between events is known but the exact timing of events is random. The arrival of an event is independent of the event before (waiting time between events is memoryless). The important point is we know the **average time between events** but they are randomly spaced (stochastic). We might have back-to-back failures but we could also go years between failures due to the randomness of the process. A Poisson Process meets the following criteria (in reality many phenomena do not meet these exactly): 1. Events are independent of each other -- the occurrence of one event does not affect the probability another event will occur. 2. The average rate (events per time period) is constant. 3. Two events cannot occur at the same time. ## Poisson Distribution The Poisson Process is the model we use for describing randomly occurring events and by itself which is not very useful. We need the **Poisson Distribution** to do interesting things such as finding the probability of a number of events in a time period or finding the probability of waiting some time until the next event. The events/time * time period is usually given as a single parameter λ called the rate parameter which is the expected number of events in the interval. As we change the rate parameter λ we change the probability of seeing different numbers of events in one interval. The expected number of events in the interval for each curve is the rate parameter λ. When λ is an integer, the rate parameter will be the number of events with the greatest probability. When λ is not an integer, the highest probability number of events will be the nearest integer to the rate parameter since the Poisson distribution is only defined for a discrete number of events. The discrete nature of the Poisson distribution is also why this is a probability _mass_ function and not a _density_ function. We can use the Poisson Distribution mass function to find the probability of observing a number of events over an interval generated by a Poisson process. Another use of the mass function equation is to find the probability of waiting some time between events. ## References [Discrete Probability Distributions in Machine Learning](https://towardsdatascience.com/fitting-linear-regression-models-on-counts-based-data-ba1f6c11b6e1) [The Poisson Distribution and Poisson Process Explained](https://towardsdatascience.com/the-poisson-distribution-and-poisson-process-explained-4e2cb17d459) [Fitting Linear Regression Models on Counts Based Data](https://towardsdatascience.com/fitting-linear-regression-models-on-counts-based-data-ba1f6c11b6e1) [Generalized Poisson Regression for Real World Datasets](https://towardsdatascience.com/generalized-poisson-regression-for-real-world-datasets-d1ff32607d79)