Download the notebook here! Interactive online version: binder

[1]:
from timeit import default_timer as timer
from functools import partial
import yaml
import sys

from estimagic import maximize
from scipy.optimize import root_scalar
from scipy.stats import chi2
import numdifftools as nd
import pandas as pd
import respy as rp
import numpy as np

sys.path.insert(0, "python")
from auxiliary import plot_bootstrap_distribution  # noqa: E402
from auxiliary import plot_computational_budget  # noqa: E402
from auxiliary import plot_smoothing_parameter  # noqa: E402
from auxiliary import plot_score_distribution  # noqa: E402
from auxiliary import plot_score_function  # noqa: E402
from auxiliary import plot_likelihood  # noqa: E402

Maximum likelihood estimation

Introduction

EKW models are calibrated to data on observed individual decisions and experiences under the hypothesis that the individual’s behavior is generated from the solution to the model. The goal is to back out information on reward functions, preference parameters, and transition probabilities. This requires the full parameterization \(\theta\) of the model.

Economists have access to information for \(i = 1, ..., N\) individuals in each time period \(t\). For every observation \((i, t)\) in the data, we observe action \(a_{it}\), reward \(r_{it}\), and a subset \(x_{it}\) of the state \(s_{it}\). Therefore, from an economist’s point of view, we need to distinguish between two types of state variables \(s_{it} = (x_{it}, \epsilon_{it})\). At time \(t\), the economist and individual both observe \(x_{it}\) while \(\epsilon_{it}\) is only observed by the individual. In summary, the data \(\mathcal{D}\) has the following structure:

\begin{align*} \mathcal{D} = \{a_{it}, x_{it}, r_{it}: i = 1, ..., N; t = 1, ..., T_i\}, \end{align*}

where \(T_i\) is the number of observations for which we observe individual \(i\).

Likelihood-based calibration seeks to find the parameterization \(\hat{\theta}\) that maximizes the likelihood function \(\mathcal{L}(\theta\mid\mathcal{D})\), i.e. the probability of observing the given data as a function of \(\theta\). As we only observe a subset \(x_t\) of the state, we can determine the probability \(p_{it}(a_{it}, r_{it} \mid x_{it}, \theta)\) of individual \(i\) at time \(t\) in \(x_{it}\) choosing \(a_{it}\) and receiving \(r_{it}\) given parametric assumptions about the distribution of \(\epsilon_{it}\). The objective function takes the following form:

\begin{align*} \hat{\theta} \equiv \text{argmax}{\theta \in \Theta} \underbrace{\prod^N_{i= 1} \prod^{T_i}_{t= 1}\, p_{it}(a_{it}, r_{it} \mid x_{it}, \theta)}_{\mathcal{L}(\theta\mid\mathcal{D})}. \end{align*}

We will explore the following issues:

  • likelihood function

  • score function and statistic

    • asymptotic distribution

    • linearity

  • confidence intervals

    • Wald

    • likelihood - based

    • Bootstrap

  • numerical approximations

    • smoothing of choice probabilities

    • grid search

Most of the material is from the following two references:

Let’s get started!

[2]:
options_base = yaml.safe_load(open("../../configurations/robinson/robinson.yaml", "r"))

params_base = pd.read_csv(open("../../configurations/robinson/robinson.csv", "r"))
params_base.set_index(["category", "name"], inplace=True)

simulate = rp.get_simulate_func(params_base, options_base)
df = simulate(params_base)

Let us briefly inspect the parameterization.

[3]:
params_base
[3]:
value
category name
delta delta 0.950
wage_fishing exp_fishing 0.070
nonpec_fishing constant -0.100
nonpec_hammock constant 1.046
shocks_sdcorr sd_fishing 0.010
sd_hammock 0.010
corr_hammock_fishing 0.000

Several options need to be specified as well.

[4]:
options_base
[4]:
{'estimation_draws': 100,
 'estimation_seed': 100,
 'estimation_tau': 0.001,
 'interpolation_points': -1,
 'n_periods': 5,
 'simulation_agents': 1000,
 'simulation_seed': 132,
 'solution_draws': 100,
 'solution_seed': 456,
 'covariates': {'constant': '1'}}

We can now look at the simulated dataset.

[5]:
df.head()
[5]:
Experience_Fishing Shock_Reward_Fishing Meas_Error_Wage_Fishing Shock_Reward_Hammock Meas_Error_Wage_Hammock Dense_Key Core_Index Choice Wage Discount_Rate ... Nonpecuniary_Reward_Fishing Wage_Fishing Flow_Utility_Fishing Value_Function_Fishing Continuation_Value_Fishing Nonpecuniary_Reward_Hammock Wage_Hammock Flow_Utility_Hammock Value_Function_Hammock Continuation_Value_Hammock
Identifier Period
0 0 0 0.999650 1 0.000410 1 0 0 fishing 0.999650 0.95 ... -0.1 0.999650 0.899650 4.739374 4.041815 1.046 NaN 1.046410 4.732282 3.879866
1 1 1.000743 1 0.015065 1 1 1 fishing 1.073305 0.95 ... -0.1 1.073305 0.973305 4.042189 3.230405 1.046 NaN 1.061065 3.905840 2.994500
2 2 0.996461 1 0.011853 1 2 2 fishing 1.146203 0.95 ... -0.1 1.146203 1.046203 3.224430 2.292871 1.046 NaN 1.057853 3.076295 2.124675
3 3 0.998907 1 -0.007859 1 3 3 fishing 1.232329 0.95 ... -0.1 1.232329 1.132329 2.292998 1.221756 1.046 NaN 1.038141 2.113919 1.132398
4 4 0.989419 1 0.012452 1 4 4 fishing 1.309130 0.95 ... -0.1 1.309130 1.209130 1.209130 0.000000 1.046 NaN 1.058452 1.058452 0.000000

5 rows × 21 columns

Likelihood function

We can now start exploring the likelihood function that provides an order of preference on \(\theta\). The likelihood function is a measure of information about the potentially unknown parameters of the model. The information will usually be incomplete and the likelihood function also expresses the degree of incompleteness

We will usually work with the sum of the individual log-likelihoods throughout as the likelihood cannot be represented without raising problems of numerical overflow. Note that the criterion function of the respy package returns to the average log-likelihood across the sample. Thus, we need to be careful with scaling it up when computing some of the test statistics later in the notebook.

We will first trace out the likelihood over reasonable parameter values.

[6]:
params_base["lower"] = [0.948, 0.0695, -0.11, 1.04, 0.0030, 0.005, -0.10]
params_base["upper"] = [0.952, 0.0705, -0.09, 1.05, 0.1000, 0.015, +0.10]

We plot the normalized likelihood, i.e. set the maximum of the likelihood function to one by dividing it by its maximum.

[7]:
crit_func = rp.get_log_like_func(params_base, options_base, df)

rslts = dict()
for index in params_base.index:

    upper, lower = params_base.loc[index][["upper", "lower"]]
    grid = np.linspace(lower, upper, 20)

    fvals = list()
    for value in grid:
        params = params_base.copy()
        params.loc[index, "value"] = value
        fval = options_base["simulation_agents"] * crit_func(params)
        fvals.append(fval)

    rslts[index] = fvals

Let’s visualize the results.

[8]:
plot_likelihood(rslts, params_base)
../_images/maximum-likelihood_notebook_17_0.svg
../_images/maximum-likelihood_notebook_17_1.svg
../_images/maximum-likelihood_notebook_17_2.svg
../_images/maximum-likelihood_notebook_17_3.svg
../_images/maximum-likelihood_notebook_17_4.svg
../_images/maximum-likelihood_notebook_17_5.svg
../_images/maximum-likelihood_notebook_17_6.svg

Maximum likelihood estimate

So far, we looked at the likelihood function in its entirety. Going forward, we will take a narrower view and just focus on the maximum likelihood estimate. We restrict our attention to the discount factor \(\delta\) and treat it as the only unknown parameter. We will use estimagic for all our estimations.

[9]:
crit_func = rp.get_log_like_func(params_base, options_base, df)

However, we will make our life even easier and fix all parameters but the discount factor \(\delta\).

[10]:
constr_base = [
    {"loc": "shocks_sdcorr", "type": "fixed"},
    {"loc": "wage_fishing", "type": "fixed"},
    {"loc": "nonpec_fishing", "type": "fixed"},
    {"loc": "nonpec_hammock", "type": "fixed"},
]

We will start the estimation with a perturbation of the true value.

[11]:
params_start = params_base.copy()
params_start.loc[("delta", "delta"), "value"] = 0.91

Now we are ready to deal with the selection and specification of the optimization algorithm.

[15]:
algo_options = {"stopping_max_criterion_evaluations ": 100}
algo_name = "nag_pybobyqa"

results = maximize(
    criterion=crit_func,
    params=params_base,
    algorithm=algo_name,
    algo_options=algo_options,
    constraints=constr_base,
)
C:\Users\Annica\anaconda3\envs\ekw-lectures\lib\site-packages\estimagic\optimization\optimize.py:851: UserWarning: The following algo_options were ignored because they are not compatible with nag_pybobyqa:

 {'stopping_max_criterion_evaluations ': 100}
  warnings.warn(

Let’s look at the results.

[28]:
params_rslt = results["solution_params"]
params_rslt
[28]:
lower lower_bound upper upper_bound value
category name
delta delta 0.9480 -inf 0.9520 inf 0.949629
wage_fishing exp_fishing 0.0695 -inf 0.0705 inf 0.070000
nonpec_fishing constant -0.1100 -inf -0.0900 inf -0.100000
nonpec_hammock constant 1.0400 -inf 1.0500 inf 1.046000
shocks_sdcorr sd_fishing 0.0030 -inf 0.1000 inf 0.010000
sd_hammock 0.0050 -inf 0.0150 inf 0.010000
corr_hammock_fishing -0.1000 -inf 0.1000 inf 0.000000
[25]:
fval = results["solution_criterion"] * options_base["simulation_agents"]
print(f"criterion function at optimum {fval:5.3f}")
criterion function at optimum -10049.384

We need to set up a proper interface to use some other Python functionality going forward.

[29]:
def wrapper_crit_func(crit_func, options_base, params_base, value):

    params = params_base.copy()
    params.loc["delta", "value"] = value

    return options_base["simulation_agents"] * crit_func(params)


p_wrapper_crit_func = partial(wrapper_crit_func, crit_func, options_base, params_base)

We need to use the MLE repeatedly going forward.

[30]:
delta_hat = params_rslt.loc[("delta", "delta"), "value"]

At the maximum, the second derivative of the log-likelihood is negative and we define the observed Fisher information as follows

\begin{align*} I(\hat{\theta}) \equiv -\frac{\partial^2 \log L(\hat{\theta})}{\partial^2 \theta} \end{align*}

A larger curvature is associated with a strong peak, thus indicating less uncertainty about \(\theta\).

[31]:
delta_fisher = -nd.Derivative(p_wrapper_crit_func, n=2)([delta_hat])
[32]:
delta_fisher
[32]:
2101705.2240765863

Score statistic and Score function

The Score function is the first-derivative of the log-likelihood.

\begin{align*} S(\theta) \equiv \frac{\partial \log L(\theta)}{\partial \theta} \end{align*}

Distribution

The asymptotic normality of the score statistic is of key importance in deriving the asymptotic normality of the maximum likelihood estimator. Here we simulate \(1,000\) samples of \(10,000\) individuals and compute the score function at the true values. I had to increase the number of simulated individuals as convergence to the asymptotic distribution just took way to long.

[33]:
plot_score_distribution()
../_images/maximum-likelihood_notebook_41_0.svg

Linearity

We seek linearity of the score function around the true value so that the log-likelihood is reasonably well approximated by a second order Taylor-polynomial.

\begin{align*} \log L(\theta) \approx \log L(\hat{\theta}) + S(\hat{\theta})(\theta - \hat{\theta}) - \tfrac{1}{2} I(\hat{\theta}))(\theta - \hat{\theta})^2 \end{align*}

Since \(S(\hat{\theta}) = 0\), we get:

\begin{align*} \log\left(\frac{L(\theta)}{L(\hat{\theta})}\right) \approx - \tfrac{1}{2} I(\hat{\theta})(\theta - \hat{\theta})^2 \end{align*}

Taking the derivative to work with the score function, the following relationship is approximately true if the usual regularity conditions hold:

\begin{align*} - I^{-1/2}(\hat{\theta}) S(\theta) \approx I^{1/2}(\hat{\theta}) (\theta - \hat{\theta}) \end{align*}
[34]:
num_points, index = 10, ("delta", "delta")

upper, lower = params_base.loc[index, ["upper", "lower"]]
grid = np.linspace(lower, upper, num_points)

fds = np.tile(np.nan, num_points)
for i, point in enumerate(grid):
    fds[i] = nd.Derivative(p_wrapper_crit_func, n=1)([point])

norm_fds = fds * -(1 / np.sqrt(delta_fisher))
norm_grid = (grid - delta_hat) * (np.sqrt(delta_fisher))

In the best case we see a standard normal distribution of \(I^{1/2} (\hat{\theta}) (\theta - \hat{\theta})\) and so it is common practice to evaluate the linearity over \(-2\) and \(2\).

[35]:
plot_score_function(norm_grid, norm_fds)
../_images/maximum-likelihood_notebook_46_0.svg

Alternative shapes are possible.

ad4bf0756b444b3ba524ab0b4ce418ad

Confidence intervals

How do we communicate the statistical evidence using the likelihood? Several notions exist that have different demands on the score function. Wile the Wald intervals rely on the asymptotic normality and linearity, likelihood-based intervals only require asymptotic normality. In well-behaved problems, both measures of uncertainty agree.

Wald intervals

[36]:
rslt = list()
rslt.append(delta_hat - 1.96 * 1 / np.sqrt(delta_fisher))
rslt.append(delta_hat + 1.96 * 1 / np.sqrt(delta_fisher))
"{:5.3f} / {:5.3f}".format(*rslt)
[36]:
'0.948 / 0.951'

Likelihood-based intervals

[37]:
def root_wrapper(delta, options_base, alpha, index):

    crit_val = -0.5 * chi2.ppf(1 - alpha, 1)

    params_eval = params_base.copy()
    params_eval.loc[("delta", "delta"), "value"] = delta
    likl_ratio = options_base["simulation_agents"] * (
        crit_func(params_eval) - crit_func(params_base)
    )

    return likl_ratio - crit_val
[38]:
brackets = [[0.75, 0.95], [0.95, 1.10]]

rslt = list()
for bracket in brackets:
    root = root_scalar(
        root_wrapper,
        method="bisect",
        bracket=bracket,
        args=(options_base, 0.05, index),
    ).root
    rslt.append(root)
print("{:5.3f} / {:5.3f}".format(*rslt))
0.948 / 0.951

Bootstrap

We can now run a simple bootstrap to see how the asymptotic standard errors line up.

Here are some useful resources on the topic:

[39]:
plot_bootstrap_distribution()
../_images/maximum-likelihood_notebook_56_0.svg
../_images/maximum-likelihood_notebook_56_1.svg

We can now construct the bootstrap confidence interval.

[40]:
fname = "material/bootstrap.delta_perturb_true.pkl"
boot_params = pd.read_pickle(fname)

rslt = list()
for quantile in [0.025, 0.975]:
    rslt.append(boot_params.loc[("delta", "delta"), :].quantile(quantile))
print("{:5.3f} / {:5.3f}".format(*rslt))
0.948 / 0.951

Numerical aspects

The shape and properties of the likelihood function are determined by different numerical tuning parameters such as quality of numerical integration, smoothing of choice probabilities. We would simply choose all components to be the “best”, but that comes at the cost of increasing the time to solution.

[41]:
grid = np.linspace(100, 1000, 100, dtype=int)

rslts = list()
for num_draws in grid:
    options = options_base.copy()

    options["estimation_draws"] = num_draws
    options["solution_draws"] = num_draws

    start = timer()
    rp.get_solve_func(params_base, options)
    finish = timer()

    rslts.append(finish - start)

We are ready to see how time to solution increases as we improve the quality of the numerical integration by increasing the number of Monte Carlo draws.

[42]:
plot_computational_budget(grid, rslts)
../_images/maximum-likelihood_notebook_63_0.svg

We need to learn where to invest a limited computational budget. We focus on the following going forward:

  • smoothing parameter for logit accept-reject simulator

  • grid search across core parameters

Smoothing parameter

We now show the shape of the likelihood function for alternative choices of the smoothing parameter \(\tau\). There exists no closed-form solution for the choice probabilities, so these are simulated. Application of a basic accept-reject (AR) simulator poses the two challenges. First, there is the ocurrance of zero probability simulation for low probability events which causes problems for the evaluation of the log-likelihood. Second, the choice probabilities are not smooth in the parameters and instead are a step function. This is why McFadden (1989) introduces a class of smoothed AR simulators. The logit-smoothed AR simulator is the most popular one and also implemented in respy. The implementation requires to specify the smoothing parameter \(\tau\). As \(\tau \rightarrow 0\) the logit smoother approaches the original indicator function.

[43]:
rslts = dict()

for tau in [0.01, 0.001, 0.0001]:

    index = ("delta", "delta")

    options = options_base.copy()
    options["estimation_tau"] = tau

    crit_func = rp.get_log_like_func(params_base, options, df)
    grid = np.linspace(0.948, 0.952, 20)

    fvals = list()
    for value in grid:
        params = params_base.copy()
        params.loc[index, "value"] = value
        fvals.append(crit_func(params))

    rslts[tau] = fvals - np.max(fvals)

Now we are ready to inspect the shape of the likelihood function.

[44]:
plot_smoothing_parameter(rslts, params_base, grid)
../_images/maximum-likelihood_notebook_69_0.svg