Integrals in Probability

Notes on the definite integral and probability modeling, with a mix of formal notation, plotly visuals, and sympy computation.
Author

Peter Amerkhanian

Published

April 14, 2024

Code
library(plotly)
library(dplyr)

x_vec <- seq(0, 25)
y_vec <- seq(0, 25)

vline <- function(x = 0, color = "black") {
  list(
    type = "line",
    y0 = 0,
    y1 = 1,
    yref = "paper",
    x0 = x,
    x1 = x,
    line = list(color = color, dash="dot")
  )
}

The following are notes on using integration for finding cumulative probability distribution functions in the univariate and joint settings. The exercises and citations are all coming out of a textbook for a course in vector calculus ().

Learning Goals

  1. Get comfortable with leveraging u subsitition to compute single and double integrals over general regions, by hand.

  2. Connect the by-hand approach to a more practically useful approach that leverages a computer algebra system. We will cover examples in sympy.

  3. Connect integration and probability models to policy analysis. These methods are of particular use in logistics problems that involve wait times.

Modeling a single waiting time

Consider the following question:

At a drive-thru restaurant, customers spend, on average, 3 minutes placing their orders […] find the probability that customers spend 6 minutes or less placing their order. […] Waiting times are mathematically modeled by exponential density functions.

()

To model this and calculate our target quantity, we’ll start by referring to the process of placing one’s order as a random variable, X. We will model that random variable using an exponential probability density function (pdf): (1)f(x)={1βexβx00x<0

Where β is the average waiting time ().

We set the equation up in R and plot it so that we can understand what it represents.

P_x <- function(x, beta=3) {
  case_when(x < 0 ~ 0,
            x >= 0 ~ (1/beta) * exp(-x/beta)
            )
}
Code
data <- data.frame(x = x_vec, y = P_x(x_vec))
ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "royalblue", size=3.1, alpha=.8) +
  geom_line(color = "royalblue", size=1.2, alpha=.7) +
  labs(x = "x (Minutes)", y = "f(x)", title = "An exponential probability density function (pdf)") +
  theme_minimal()

At each x, this function outputs the probability that it takes x minutes placing an order in the drive-thru. For example, the probability that it takes exactly six minutes to place an order is as follows:

P_x(6)
[1] 0.04511176

That’s what a pdf gives us and that’s cool. However, our question asks for the probability that it takes less than or equal to six minutes to order. That involves adding up all of the probabilities for times [0,6] over this continuous function – sounds like an integral. Indeed, we’ll need to integrate over the following region:

Code
plot_ly(
  x = x_vec,
  y = P_x(x_vec),
  type = "scatter",
  mode="line",
  name = "P(x)") %>% add_trace(
    x = 0:6,
    y = P_x(0:6, 3),
    type = "scatter",
    mode="none",
    fill = "tozeroy",
    name = "P(x) <= 6"
) %>% 
  layout(plot_bgcolor = "#e5ecf6", shapes = list(vline(6))
         )

We can formally define that integral as follows:

P(X6)=06f(x)dx=061βexβdx Now that we’ve defined the problem, we’ll solve it by-hand, then using symbolic computation software.

Solving by hand

After setting the problem up, we can solve it step-by-step using u substitution to compute the integral: P(X6)=061βexβdxSubstituteu=xβAnd substitutedxdudx=1βdx=βduP(X6)=061βeu(βdu)=106eudu=eu|06P(X6)=exβ|06

The last line gives us the following algebra: exβ|06=e63(e03)=1e63 Leading us to our answer:

(-exp(-6/3)) + 1
[1] 0.8646647

Which is the probability that it takes six minutes or less to order in this drive-thru.

Solving with sympy

Now we’ll overview a more practically useful way of solving the integral using python’s sympy library. To switch to python, I’ll first load the reticulate library in R and specify that I want to use my conda install of python, which comes pre-loaded with sympy.

library(reticulate)
use_condaenv('base')

Now I’ll switch to python code and set up some preliminaries – I define a custom render function so that I can output the sympy equations I create in LaTeX:

python
import sympy as sp
from IPython.display import Markdown
render = lambda x: Markdown(sp.latex(x, mode='equation'))

Now I’ll define the variables, the constant e, and in sympy. sympy will automatically apply some simplification to the expression, so don’t worry if it doesn’t look exactly like .

python
# Define the variables
x, beta = sp.symbols('x beta')
# Define the constant, e
e = sp.exp(1)
# Define the equation
pdf = (1/beta) * e**(-x/beta)
# Render the equation
render(pdf)

exββ

In our specific problem, we know that the average time it takes to place an order is 3, so we’ll substitute in β=3.

python
f_x = pdf.subs({beta: 3})
render(f_x)

ex33

Integration using sympy is trivially easy – to get the indefinite integral, we can just call .integrate() and supply the variable of integration.

python
render(f_x.integrate(x)) 

ex3

The problem requires the definite integrate, which is just as straightforward – we supply the variable to integrate over, and the upper and lower limits of integration.

python
render(f_x.integrate( (x, 0, 6) )) 

1e2

We’ll retrieve the decimal output by running .evalf() on the integral.

python
f_x.integrate( (x, 0, 6) ).evalf()
0.864664716763387

Much less painful than u-substitution!

Cumulative Probability

When we derived the indefinite integral, 1βexβdx=exβ, in the process of solving our definite integral problem, we actually derived another useful function – the cumulative distribution function (cdf). The cumulative distribution function describes the probability that x is some number, say a, or less, and it is generally the integral of the probability density function (). In the case of the exponential function, it is as follows:

F(a)=P(xa)=0aP(x)dx=exβ|0a

Or, in code:

cdf_x <- function(x, beta=3){
  -1 * exp(-x/beta)
}

We’ll confirm that F(6)=exβ|06

1 + cdf_x(6)
[1] 0.8646647

Here are the pdf and cdf functions plotted together for the exponential distribution with an average of 3 (β=3). At each point on the orange line (the pdf), you can hover over and find out the probability that the drive-thru takes exactly x minutes, whereas on the blue line (the cdf), you’ll find the probability that the drive-thru takes less than or equal to x minutes.

Code
plot_ly(
  x = x_vec,
  y = 1 + cdf_x(x_vec),
  type = "scatter",
  mode="lines+markers",
  name = "CDF: F(a)") %>% 
  add_trace(
    x = x_vec,
    y = P_x(x_vec),
    type = "scatter",
    mode="lines+markers",
    name = "PDF: P(x)"
)

Modeling multiple waiting times

At Sydney’s Restaurant, customers must wait an average of 15 minutes for a table. From the time they are seated until they have finished their meal requires an additional 40 minutes, on average. What is the probability that a customer spends less than an hour and a half at the diner, assuming that waiting for a table and completing the meal are independent events?

()

Since we are given two random variables and asked to find the probability of an event where something specific happens to each one, we can conceive of this problem using a joint density function. We are given that these two waiting times are independent (not correlated), so we can model joint density as follows:

The variables X and Y are said to be independent random variables if their joint density function is the product of their individual density functions: f(x,y)=f1(x)f2(y)

()

The two functions given in the problem are:

  1. an exponential density function with mean 15 (waiting for the table)

f1(x)={115ex15x00x<0

  1. an exponential density function with mean 40 (eating at the table)

f2(y)={140ey40y00y<0

This gives us the joint density function:

(2)f(x,y)=f1(x)f2(y)=[{140ex40x00x<0]×[{115ey15y00y<0]={1600ex40ey15x,y00x<0ory<0

We can set that up in code as well

pdf_function <- function(x, y) {
  table_wait <- P_x(x, 15)
  eating_time <- P_x(y, 40)
  total_time <- table_wait * eating_time
  return(total_time)
}

This function allows us to find the probability of our table wait + eating time being a specific amount of minutes. For example, what’s the probability that it takes us 15 minutes to secure a table and 40 minutes to eat?

pdf_function(15, 40)
[1] 0.0002255588

This is a really small number, which makes some level of intuitive sense – it seems remote that the whole restaurant experience will take exactly 55 minutes, no more no less.

Now, returning to the question – we want to find the probability that the whole restaurant experience (waiting for a table and eating) takes an hour and a half (90 minutes) or less. Again, we’ll need to add up all of the probabilities from 0 to 90 minutes, implying that we will solve this with an integral. However, we are in a multivariate setting, so this will require multiple integration.

To set up the problem, we’ll plot the joint density function, the xy plane, and this 90 minute constraint.

Code
scene <- list(
  camera = list(eye = list(
    x = 2, y = 1.1, z = 1.2
  )),
  xaxis = list(title = "X"),
  yaxis = list(title = "Y"),
  zaxis = list(title = "P(X,Y)")
)
# Define range for x and y
x_range <- seq(0, 50, length.out = 100)
y_range <- seq(0, 50, length.out = 100)
xy_grid <- expand.grid(x = x_range, y = y_range)

z_values <- pdf_function(xy_grid$x, xy_grid$y)
z_plane <- rep(0, nrow(xy_grid))

# x y constraint
x_range_constraint <- seq(40, 50, length.out = 100)
y_constraint <- 90 - x_range_constraint
xy_constraint_grid <- expand.grid(x = x_range_constraint, y = y_constraint)
z_constraint <- rep(0, nrow(xy_grid))


# Reshape z values to create a matrix for plotting
z_matrix <- matrix(z_values, nrow = length(x_range), ncol = length(y_range))
z_constraint_matrix <- matrix(z_constraint, nrow = length(x_range), ncol = length(y_constraint))
z_plane_matrix <- matrix(z_plane, nrow = length(x_range), ncol = length(y_range))


# Create 3D plot using plot_ly
plot_ly(x = x_range, y = y_range, z = z_plane_matrix, type = "surface") %>% hide_colorbar() %>%
  add_trace(
    x = x_range,
    y = y_range,
    z = z_matrix,
    type = "surface",
    colorscale = "Cividis",
    colorbar = list(title = "P(X,Y)")
  ) %>%
  add_trace(
    x = x_range_constraint,
    y = y_constraint,
    z = 0,
    type = "scatter3d",
    name = "X + Y <= 90",
    mode="lines+markers"
  ) %>% layout(scene = scene)

Let’s first define a region of integration, D. We are given x+y90, and given that we are talking about minutes, which are bounded [0,), we can make that 0x+y90. With some algebraic manipulation, we can define the region as:
D={(x,y)|0x90,0y90x}

Setting that up as a definite integral, we get the following:
P(X+Y90)=P((X,Y)D)=Df(x,y)dA

Given that there is no density for this function when x or y is less than zero, we can ignore the piece wise structure of the joint pdf and set up the integral using only the portion that defines the function on the domain x[0,). This gives us:

=090090x1600ex40ey15dydx=1600090090xex40y15dydx

Solving by hand

Computing this integral is difficult, but we’ll walk through it. We will again use u substitution.

u=x40y15dudy=115dy=15du We can plug that in to finish the first integral:
=1600090090xeu(15)dudx=15600090090xeududx=15600090eu|090xdx=15600090ex40y15|090xdx=15600090ex4090x15ex40015dx=15600090e3x1208(90x)120ex40dx=15600090e3x720+8x120ex40dx=15600090e5x720120ex40dx=15600090ex246ex40dx

That was pretty difficult! Now we can continue on to the second integral and split it into two:

=15600[090ex246dx090ex40dx]

We’ll solve these integrals with u and v substitution, with our terms defined as follows:

u=x246dudx=124dx=24duv=x40dvdx=140dx=40dv
When we plug these terms back in, we can finally compute the definite integral.

=15600[24090eudu+40090evdv]=15600[24eu|090+40ev|090]=15600[24ex246|090+40ex40|090]=15600[24(e90246e0246)+40(e9040e040)]=15600[24(e2.25e6)+40(e2.251)]=15600[33.3139396802807]=0.832848492007017

Thus, the probability that the total experience is 90 minutes or less is about 83%.

Solving with sympy

Computing that double integral by hand provides us with a lot of opportunities to make a small error and break the whole computation. In practice, we can solve problems like this in a much faster and less error-prone manner using symbolic computation software. We’ll again switch to python and use sympy.

Returning to the problem,

P(X+Y90)=1600090090xex40y15dydx
We’ll first define the function that we are integrating over.

python
# Define the variables
x, y, = sp.symbols("x y")
# Define the function
f_x_y = e ** (-x/40 - y/15)
render(f_x_y)

ex40y15

Now, we’ll compute the entire double integral in one line. Note that we chain the .integrate() method, each time over a different variable, to perform the multiple integration.

python
(1/600) * (
  f_x_y
  .integrate( (y, 0, 90-x) )
  .integrate( (x, 0, 90) )
  .evalf()
  )
0.832848492007017

We confirm that this is the same answer we arrived at by hand.

Deriving a generalizable approach

Let’s look at another problem, but now, instead of solving just one specific problem, we’ll derive an equation that can easily solve any question structured like this so that don’t have to do so many integrals…

[…] At a drive-thru restaurant, customers spend, on average, 3 minutes placing their orders and an additional 5 minutes paying for and picking up their meals. Assume that placing the order and paying for/picking up the meal are two independent events X and Y. If the waiting times are modeled by the exponential probability densities. Find P[X+Y6] and interpret the result.

() - Question 119

We’ll again set up the joint density function: f(x,y)=f1(x)f2(y)=[{13ex3x00x<0]×[{15ey5y00y<0]={115ex3ey5x,y00x<0ory<0 and the region of integration appropriate for answering the question:
D={(x,y)|0x6,0y6x} This leaves us with the integral: 1150606xex3y5dydx

However, rather than solve this specifically, lets take the opportunity to derive a more general solution that can give us the target probability for any question phrased like this.

We’ll replace all the specific values:

  • 3, which is the average amount of minutes in the drive-thru, with β1,
  • 5, which is the average amount of minutes spent ordering, with β2, and
  • 6, which is the maximum number of minutes for the whole experience, with c.

That gives us: =1β1β20c0cxexβ1yβ2dydx

Solving by hand

Let’s walk through one more painful double integral by-hand. We’ll start with u substitution:
u=xβ1yβ2dudy=1β2dy=β2du

And plug those in to solve the first definite integral:

=1β10c0cxeududx=1β10cexβ1yβ2|0cxdx=1β10cexβ1cxβ2exβ10β2dx=1β10ceβ2xβ1β2β1(cx)β1β2exβ1dx=1β10ceβ2xβ1c+β1xβ1β2exβ1dx=1β1[0ce(β1β2)xβ1cβ1β2dx0cexβ1dx] Now we’ll use u and v substitution to solve the last definite integral:

u=(β1β2)xβ1cβ1β2dudx=(β1β2)β1β2dx=β1β2(β1β2)duv=xβ1dvdx=1β1dx=β1dv

Plugging those in, we arrive at a general function of β1, β2, and c.

=1β1[0ceuβ1β2(β1β2)du0cev(β1dv)]=1β1[β1β2(β1β2)0ceudu(β1)0cevdv]=1β1[β1β2(β1β2)(e(β1β2)xβ1cβ1β2|0c)+β1(exβ1|0c)]=[β2(β1β2)(e(β1β2)cβ1cβ1β2eβ1cβ1β2)(ecβ1e0β1)]=β2(β1β2)[e(β1β2)cβ1cβ1β2eβ1cβ1β2]ecβ1+1f(β1,β2,c)=β2(β1β2)[ecβ1ecβ2]ecβ1+1

Now we’ll code that equation up in R or the calculator of your choice.

cdf_wait <- function(beta1, beta2, c) {
  (-beta2/(beta1-beta2)) * 
  (exp((-c)/beta1) - exp((-c)/beta2)) -
  exp((-c)/beta1) + 1 
}

Going back to our question – let’s see what’s the probability that the drive-thru experience takes 6 minutes or less if waiting to order takes 3 minutes on average and ordering takes 5 minutes on average. We can very easily calculate it with the flexible function f(β1,β2,c) that we just derived:

cdf_wait(3, 5, 6)
[1] 0.4500174

There’s a ~45% chance that a customer will spend 6 minutes or less in the drive-thru. We can also use our function, f(β1,β2,c), to easily compute and visualize the probability for any total drive-thru time given these two average times.

Code
plot_ly(
  x = x_vec,
  y = cdf_wait(3, 5, x_vec),
  type = "scatter",
  mode="lines+markers",
  name = "CDF: F(a)")

And, going back to the first multiple integration question about Sydney’s restaurant , we can now answer that with the same function.

cdf_wait(15, 40, 90)
[1] 0.8328485

We again can find the cumulative probability for any waiting time given the average time to get a table (15) and eat (40).

Code
plot_ly(
  x = seq(0, 150),
  y = cdf_wait(15, 40, seq(0, 150)),
  type = "scatter",
  mode="lines+markers",
  name = "CDF: F(a)")

Solving with sympy

For the sake of completeness, we’ll derive that same generizable equation using sympy. Returning to the problem,

P(X+Yx)=1β1β20c0cxexβ1yβ2dydx

we’ll define the constants and function in sympy.

python
x, y = sp.symbols('x y')
beta1, beta2, c = sp.symbols('beta1 beta2 c')
f_x_y_general = e**(-x/beta1 - y/beta2)
render(f_x_y_general)

eyβ2xβ1

We compute the double integral, and define our generalizable function as f_b1_b2_c.

python
f_b1_b2_c = ( 1/(beta1 * beta2) ) * (
  f_x_y_general
  .integrate( (y, 0, c - x) )
  .integrate( (x, 0, c) )
)
render(f_b1_b2_c.simplify())

{β1β1ecβ1β2+β2ecβ2β1β2forβ1β21ecβ1cβ1otherwise

Note that for more complex equations like this, sympy’s simplified output might not look like what you’d reach by hand. We can be more confident that this is indeed the same f(β1,β2,c) as the one we computed by hand above by testing it with a few problems and confirming the expected output.

Returning to the drive-thru experience, where waiting to order takes 3 minutes on average and ordering takes 5, we can compute the probability that the whole experience takes 6 minutes or less as follows:

python
f_b1_b2_c.subs({beta1:  3, beta2: 5, c:6}).evalf()
0.450017395074414

This is the same as what we get using the by-hand function.

cdf_wait(3, 5, 6)
[1] 0.4500174

In the case of the first problem, with a restaurant where ordering takes 40 minutes on average and eating takes 15, we can compute the probability that the whole experience takes 90 minutes or less just as easily:

python
f_b1_b2_c.subs({beta1:  40, beta2: 15, c:90}).evalf()
0.832848492007017

Again, let’s check this with the by-hand version.

cdf_wait(15, 40, 90)
[1] 0.8328485

Great!

References

“Cumulative Distribution Function.” 2024. Wikipedia. https://en.wikipedia.org/w/index.php?title=Cumulative_distribution_function&oldid=1215839028.
Strang, Gilbert, and Edwin Herman. 2016. Calculus Volume 3. OpenStax.

Footnotes

  1. This is not strictly necessary – I do it purely so that I can reference e in my code and match the problem notation↩︎

Citation

BibTeX citation:
@online{amerkhanian2024,
  author = {Amerkhanian, Peter},
  title = {Integrals in {Probability}},
  date = {2024-04-14},
  url = {https://peter-amerkhanian.com/posts/multiple-integration/},
  langid = {en}
}
For attribution, please cite this work as:
Amerkhanian, Peter. 2024. “Integrals in Probability.” April 14, 2024. https://peter-amerkhanian.com/posts/multiple-integration/.