\(
\def\A{{\bf A}}
\def\a{{\bf a}}
\def\B{{\bf B}}
\def\b{{\bf b}}
\def\C{{\bf C}}
\def\c{{\bf c}}
\def\D{{\bf D}}
\def\d{{\bf d}}
\def\E{{\bf E}}
\def\e{{\bf e}}
\def\f{{\bf f}}
\def\F{{\bf F}}
\def\K{{\bf K}}
\def\k{{\bf k}}
\def\L{{\bf L}}
\def\H{{\bf H}}
\def\h{{\bf h}}
\def\G{{\bf G}}
\def\g{{\bf g}}
\def\I{{\bf I}}
\def\R{{\bf R}}
\def\X{{\bf X}}
\def\Y{{\bf Y}}
\def\OO{{\bf O}}
\def\oo{{\bf o}}
\def\P{{\bf P}}
\def\Q{{\bf Q}}
\def\r{{\bf r}}
\def\s{{\bf s}}
\def\S{{\bf S}}
\def\t{{\bf t}}
\def\T{{\bf T}}
\def\x{{\bf x}}
\def\y{{\bf y}}
\def\z{{\bf z}}
\def\Z{{\bf Z}}
\def\M{{\bf M}}
\def\m{{\bf m}}
\def\n{{\bf n}}
\def\U{{\bf U}}
\def\u{{\bf u}}
\def\V{{\bf V}}
\def\v{{\bf v}}
\def\W{{\bf W}}
\def\w{{\bf w}}
\def\0{{\bf 0}}
\def\1{{\bf 1}}
\def\AM{{\mathcal A}}
\def\EM{{\mathcal E}}
\def\FM{{\mathcal F}}
\def\TM{{\mathcal T}}
\def\UM{{\mathcal U}}
\def\XM{{\mathcal X}}
\def\YM{{\mathcal Y}}
\def\NM{{\mathcal N}}
\def\OM{{\mathcal O}}
\def\IM{{\mathcal I}}
\def\GM{{\mathcal G}}
\def\PM{{\mathcal P}}
\def\LM{{\mathcal L}}
\def\MM{{\mathcal M}}
\def\DM{{\mathcal D}}
\def\SM{{\mathcal S}}
\def\RB{{\mathbb R}}
\def\EB{{\mathbb E}}
\def\tx{\tilde{\bf x}}
\def\ty{\tilde{\bf y}}
\def\tz{\tilde{\bf z}}
\def\hd{\hat{d}}
\def\HD{\hat{\bf D}}
\def\hx{\hat{\bf x}}
\def\hR{\hat{R}}
\def\Ome{\mbox{\boldmath$\omega$\unboldmath}}
\def\bet{\mbox{\boldmath$\beta$\unboldmath}}
\def\et{\mbox{\boldmath$\eta$\unboldmath}}
\def\ep{\mbox{\boldmath$\epsilon$\unboldmath}}
\def\ph{\mbox{\boldmath$\phi$\unboldmath}}
\def\Pii{\mbox{\boldmath$\Pi$\unboldmath}}
\def\pii{\mbox{\boldmath$\pi$\unboldmath}}
\def\Ph{\mbox{\boldmath$\Phi$\unboldmath}}
\def\Ps{\mbox{\boldmath$\Psi$\unboldmath}}
\def\pss{\mbox{\boldmath$\psi$\unboldmath}}
\def\tha{\mbox{\boldmath$\theta$\unboldmath}}
\def\Tha{\mbox{\boldmath$\Theta$\unboldmath}}
\def\muu{\mbox{\boldmath$\mu$\unboldmath}}
\def\Si{\mbox{\boldmath$\Sigma$\unboldmath}}
\def\Gam{\mbox{\boldmath$\Gamma$\unboldmath}}
\def\gamm{\mbox{\boldmath$\gamma$\unboldmath}}
\def\Lam{\mbox{\boldmath$\Lambda$\unboldmath}}
\def\De{\mbox{\boldmath$\Delta$\unboldmath}}
\def\vps{\mbox{\boldmath$\varepsilon$\unboldmath}}
\def\Up{\mbox{\boldmath$\Upsilon$\unboldmath}}
\def\Lap{\mbox{\boldmath$\LM$\unboldmath}}
\newcommand{\ti}[1]{\tilde{#1}}
\def\tr{\mathrm{tr}}
\def\etr{\mathrm{etr}}
\def\etal{{\em et al.\/}\,}
\newcommand{\indep}{{\;\bot\!\!\!\!\!\!\bot\;}}
\def\argmax{\mathop{\rm argmax}}
%\newcommand{\argmax}{\mathop{\mathrm{argmax}\nolimits}}
\def\argmin{\mathop{\rm argmin}}
%\newcommand{\argmin}{\mathop{\mathrm{argmin}\nolimits}}
\def\vec{\text{vec}}
\def\cov{\text{cov}}
\def\dg{\text{diag}}
\)

# Bayesian Deep Learning Models and Inference in the Article

'Assessment of Medication Self-Administration Using Artificial Intelligence'

Here I briefly describe some details on the Bayesian Deep Learning models and the inference used in the article 'Assessment of Medication Self-Administration Using Artificial Intelligence'.
## Vanilla Inference

#### Model Formulation

The goal of the inference procedure is to produce frame-level classification that take into account both the frame-level prediction (from the deep learning component) and the transition probability between different classes (estimated from training data).
We denote that $z_t$ is an integer ($z_t \in H = \{1,\dots,K\}$) indicating the class of the $t$-th frame ($t=1,\dots,T$); this is the target of our inference. Note that $z_t = 1$ indicates non-event frames (i.e., the subject is not performing actually medication administration at frame $t$). $z_t = h$, where $h = 2, \dots, K$, indicates that the subject is performing step $h-1$.
Given the transition matrix $\A$ estimated from data and the frame-level classification probability $q(z_t)$, we want to find the configuration $\{z_t\}_{t=1}^T$ with the highest score (i.e., the most probable configuration) for each frame $z_t$. The score for a configuration is defined as:
\begin{align}
f(z_1,\dots,z_T) = \prod_t p(z_t | z_{t-1}) q(z_t), \label{eq:likelihood}
\end{align}
where $p(z_t | z_{t-1}) = \A_{z_{t-1}z_t}$ is the transition probability term with $\A$ as the transition matrix, and $q(z_t) \in \mathbb{R}^K$ is the classification result for the $t$-th frame from the deep learning component. The most probable configuration will therefore take into account both the frame-level prediction from the deep learning component and the transition probability between different classes.
#### Inference Algorithm

We can use a forward-backward algorithm (essentially dynamic programming) to find the configuration with the highest $f(z_1,\dots,z_T)$. Specifically,
In the forward pass, we compute:
\begin{align*}
\s_{t,k} &= \max_{k'} \s_{t-1,k'} p(z_t=k | z_{t-1} = k') q(z_t = k), \\
\u_{t,k} &= \argmax_{k'} \s_{t-1,k'} p(z_t=k | z_{t-1} = k') q(z_t = k),
\end{align*}
for each $t$ and $k$, where $\s_{t,k}$ is the score for the most probable configuration up to $z_t$ (i.e., $z_{1:t} = z_1, \dots, z_t$) if $z_t = k$. $\u_{t,k}$ is the corresponding value of $z_{t-1}$ for the most probable $z_{1:t}$.
In the backward pass, we can simply compute $\argmax_k s_{T,k}$ and backtrack the most probable configuration.
#### Note

In the context of HMM, the term $q(z_t)$ can be formulated as an emission distribution with the deep learning output as observation. Specifically, we can have a beta distribution with $c_t=[c_{t,k}]_{k=1}^K$ as the random variable parameterized by $z_t$ ($z_t=[z_{t,k}]_{k=1}^K$ is a one-hot vector) as follows:
\begin{align*}
q(z_t) &= \prod_{k=1}^K p(c_{t,k} | z_{t,k}) \\
p(c_{t,k} | z_{t,k}) &= c_{t,k}^{\alpha - 1} (1 - c_{t,k})^{\beta - 1} / B(\alpha, \beta),
\end{align*}
where $\alpha = z_{t,k} + 1$ and $\beta = 2 - z_{t,k}$. The normalizing factor
\begin{align*}
B(\alpha, \beta) &= \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)}, \\
&= \frac{\Gamma(z_{t,k} + 1) \Gamma(2 - z_{t,k})}{\Gamma(3)}, \\
&= \frac{\Gamma(1) \Gamma(2)}{\Gamma(3)}, \\
&= \frac{1}{2},
\end{align*}
which is a constant given that $z_{t,k}$ can either be $1$ or $0$. Note that here $c_t=[c_{t,k}]_{k=1}^K$ corresponds to the frame-level prediction produced by the deep learning component.
## Advanced Inference with Stochastic Process

#### Model Formulation

The inference algorithm above has two disadvantages: (1) It cannot correctly capture the transition between different steps in the medical administration since there are usually non-event frames (frames where the subjects are not performing administration) between steps. (2) It cannot incorporate the knowledge on duration of each step and the gap between consecutive steps.
To address the issues above, we replace the transition probability $p(z_t | z_{t-1})$ in \eqnref{eq:likelihood} with a stochastic process incorporating prior knowledge. Specifically we have
\begin{align}
f(z_1,\dots,z_T) = \prod_t p(z_t | z_{1:t-1}) q(z_t), \label{eq:likelihood_new}
\end{align}
where $p(z_t | z_{1:t-1})$ is defined recursively as follows:
$$p(z_t | z_{1:t-1})=
\begin{cases}
\D_{h(z_{1:t-1}),k} \Phi_{h(z_{1:t-1}),k}(g(1, z_{1:t-1}) + 1),& z_{t-1} = 1 \text{ and } k>1, \\
1 - \sum_{k=2}^K \D_{h(z_{1:t-1}),k} \Phi_{h(z_{1:t-1}),k}(g(1, z_{1:t-1}) + 1),& z_{t-1} = 1 \text{ and } k=1,\\
1 - \Psi_{k'}(g(k', z_{1:t-1}) + 1),& z_{t-1} = k' > 1 \text{ and } k = k',\\
\Psi_{k'}(g(k', z_{1:t-1}) + 1),& z_{t-1} = k' > 1 \text{ and } k = 1,\\
0,& z_{t-1} = k' > 1 \text{ and } k \neq 1 \text{ and } k \neq k'
\end{cases}$$
where $h(z_{1:t-1})$ is the label for the last $z_x$ in the sequence $z_{1:t-1}$ such that $z_x > 1$ (the last event step up to the $t-1$-th frame. $g(k, z_{1:t-1})$ is the number of consecutive frames (ending in $z_{t-1}$) with the class label $k$. $\D$ is the transition matrix that defines the transition between different classes.
Note that $\A$ is frame-level while $\D$ is step-level. $\D_{k',k}$ is the probability of changing from class $k'$ to class $k$, where both $k'>1$ and $k>1$. For example, if a subject can perform either step $2$ or step $3$ (equally probable) after performing step $1$, we have $\D_{2,3} = \D_{2,4} = 0.5$.
$\Phi_{k',k}(\Delta t)$, where $k' > 1$ and $k > 1$, is the CDF of the Gaussian distribution that models the duration of gap between step $(k' - 1)$ and step $(k - 1)$. For example, if the duration of the gap between step 2 and step 3 follows the Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$, $\Phi_{3,4}(\Delta t)$ is the corresponding CDF. The parameters $\mu$ and $\sigma$ can be estimated from training data. Similarly, $\Psi_{k'}(\Delta t)$ the CDF of a Gaussian distribution that models the duration of step $(k' - 1)$, and the corresponding parameters can be estimated from training data.
This formulation can be seen as a Poisson process where we replace the exponential distribution with a Gaussian distribution, in order to capture both the mean and variance of the duration.
#### Inference Algorithm

Similar to the previous section, we can use a forward-backward algorithm to *approximately* find the configuration with the highest $f(z_1,\dots,z_T)$. Specifically,
In the forward pass, we compute:
\begin{align*}
\s_{t,k} &= \max_{k'} \s_{t-1,k'} p(z_t=k | z_{t-1} = k', \widehat{z}_{1:t-2}) q(z_t = k), \\
\u_{t,k} &= \argmax_{k'} \s_{t-1,k'} p(z_t=k | z_{t-1} = k', \widehat{z}_{1:t-2}) q(z_t = k),
\end{align*}
for each $t$ and $k$, where $\s_{t,k}$ is the score for the most probable configuration up to $z_t$ (i.e., $z_{1:t} = z_1, \dots, z_t$) if $z_t = k$. $\u_{t,k}$ is the corresponding value of $z_{t-1}$ for the most probable $z_{1:t}$. Note that $\widehat{z}_{1:t-2}$ is the approximately optimal configuration corresponding to $z_{t-1} = k'$.
In the backward pass, we can simply compute $\argmax_k s_{T,k}$ and backtrack the most probable configuration.
#### Note

Since the first-order Markovian no longer holds, the algorithm above is not guaranteed to find the optimal configuration.
## Relevant Work on Bayesian Deep Learning

- Towards Bayesian deep learning: a framework and some existing methods.

Hao Wang, Dit-Yan Yeung.

*IEEE Transactions on Knowledge and Data Engineering (TKDE)*, 28(12):3395-3408, 2016*.*

[pdf]

- Natural parameter networks: a class of probabilistic neural networks.

Hao Wang, Xingjian Shi, Dit-Yan Yeung.

*Thirtieth Annual Conference on Neural Information Processing Systems (NIPS)*, 2016*.*

[pdf] [supplementary] [spotlight video] [code and data]

- Collaborative deep learning for recommender systems.

Hao Wang, Naiyan Wang, Dit-Yan Yeung.

*Twenty-First ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)*, 2015*.*

**Most
cited paper among all papers at KDD 2015.**

[pdf] [project page] [code] [data] [MXNet code] [TensorFlow/Keras/Python code] [ipynb] [slides] [slides (long)]

- Collaborative recurrent autoencoder: recommend while learning to fill in the blanks.

Hao Wang, Xingjian Shi, Dit-Yan Yeung.

*Thirtieth Annual Conference on Neural Information Processing Systems (NIPS)*, 2016*.*

[pdf] [supplementary] [spotlight video] [code and data]

- A survey on Bayesian deep learning.

Hao Wang, Dit-Yan Yeung.

*ACM Computing Surveys (CSUR)*, 53(5), Article 108, 2020*.*

[pdf] [project page]