ICU patients are vulnerable to medications, especially infusion medications, and the rate and dosage of infusion drugs may worsen the condition. The mortality prediction model can monitor the real-time response of patients to drug treatment, evaluate doctors’ treatment plans to avoid severe situations such as inverse Drug-Drug Interactions (DDI), and facilitate the timely intervention and adjustment of doctor’s treatment plan. The treatment process of patients usually has a time-sequence relation (which usually has the missing data problem) in patients’ treatment history. The state-of-the-art method to model such time-sequence is to use Recurrent Neural Network (RNN). However, sometimes, patients’ treatment can last for a long period of time, which RNN may not fit for modelling long time sequence data. Therefore, we propose to use the heterogeneous medication events driven LSTM to predict the outcome of the patient, and the Natural Language Processing and Gaussian Process (GP), which can handle noisy, incomplete, sparse, heterogeneous and unevenly sampled patients’ medication records. In our work, we emphasize the semantic meaning of each medication event and the sequence of the medication events on patients, while also handling the missing value problem using kernel-based Gaussian process. We compare the performance of LSTM and Phased-LSTM on modelling the outcome of patients’ treatment and data imputation using kernel-based Gaussian process and conduct an empirical study on different data imputation approaches.

ICU provides a tremendous amount of medical data, which is generated by the interactions between patients and ICU staff and the continuous patients’ physical measurements. This large amount of medical data provides a great opportunity for machine learning algorithms. Nowadays, a substantial amount of existing research has utilized machine learning techniques in the medical field, such as the diagnosis procedure [

However, the biggest challenge is to utilize the EHR data, which is due to the properties of EHR data such as high dimension, heterogeneity, missing values, and long temporal dependency, etc. EHR contains diverse medical features from different sources (

One of the most important characteristics of EHR is the time sequential data nature. Using the recurrent neural network (RNN) has been the current state-of-the-art for modelling sequence type of data because of its memory mechanism inside its cell structure. However, patients’ hospitalization typically spams over a long period of time, which using the RNN may have trouble with the gradient explosion or gradient vanish. Hochreiter [

EHR data consists of various medical event, which contains rich latent relationships,

We propose a feature representation learning framework for the problems of heterogeneous type of time-series data from multi-source irregular sampling in EHR. This framework is to build models based on natural language processing and Guassian Process to improve time-series data.

Through experiments of mortality risk prediction by MIMIC III clinical fluid-related medical events and diagnosis report, we demonstrate the effectiveness of the model framework that we proposed this data process methods and using proposed networks.

We compared several popular data imputation approaches for time-series missing values problem. We present that the Gaussian Process with squared exponential (SE) covariance kernel function has the best performance.

Recent research has proposed various methods to generate textual health care data representation. One approach is to construct a latent space representation for patients. Using the latent space representation can preserve the patients’ features and model the patients’ condition. Caballero Barajas et al. [

Some other researches have used natural language processing techniques to train different word representations to construct potential spaces. There are various approaches developed recently to generate the word embedding such as word2Vec [

In our work, we want to study the effects of a sequence of clinical medication events for patients. Those events can be homogeneous and heterogeneous. Using the word2Vec model can better preserve the similarities between homogeneous medication events and capture the difference between heterogeneous medication events.

Clinical data has the challenges of irregular-sampling, high-dimensionality, sparsity, heterogeneous data types. Many methods have been proposed to address these challenges, such as Matrix Factorization [

Besides, the Gaussian Process is also used for handling the missing-value problem. The missing-value problem can be viewed as the prediction of the missing values over a set of continuous quantities. Thus, we can train Gaussian Process regression model over the observed values and output the predicted results for missing values. The common way of interpreting the Gaussian Process is a distribution over functions and inference occurs in the space of functions [

Deep learning has been proven to be an effective approach to making predictions on patient outcomes, compared with other machine learning algorithms. A common feed-forward network fails to model the data with temporal time-dependency relationship because model requires to use the information from previous time into current calculation. Data with temporal time-dependency is also called sequence data. Recurrent Neural Network are commonly applied to sequence data. However, patient’s data commonly has long-term dependency, which also represents the long sequence length. During the training, the back propagation of the RNN requires longer calculation. Sometimes, we cannot train a RNN based deep neural network model when the data has a long temporal dependency property. This is mainly because of the vanish gradient problem or gradient explosion problem discussed by Sepp Hochreiter [

The dramatically growing of the EHR scales the amount of data. The patients' data has even longer temporal time-dependency. The base structure of LSTM does not meet this trend. Recent research has been focusing on modifying the structure of the RNN/LSTM neuron. Kounik et al. [

We extracted fluid-input-related event records from MIMIC III database [

Our framework pipeline is shown in

The process of preparing medical event representation for each patient is illustrated in

We used the word2Vec model to construct word vector representation. word2Vec is a deep neural network model that can implement the vector representation of the word [

The process of building our own word2Vec model is very similar to the normal NLP process.

Given a collection of medical events for patient _{i} has the dimension of _{i} has the dimension of _{i} and _{i} will concatenate together to become the input tensor with dimension _{i} can be represented as a word vector _{i} =

The Gaussian process imputes the missing values until all the values are uniformly distributed over time. Then, two vectors concatenate together as the input to the neural network. The dimension of the word vector may be too high and also increases the number of parameters in our model. The large number of parameters in the model could lead to problems like over-fitting, long training time, etc. We decided to apply the dimensional reduction technique on the word vector. We currently use the Principle Component Analysis (PCA) to reduce the dimension so that the length of event vector representation will be optimal for model input, while the similarities and dissimilarities of the event vector can still be preserved.

The hallmarks of EHR data are the missing values, high dimensionality, irregular sampling and heterogeneous data types. These problems can greatly influence the performance of the prediction model.

This major problem in the time series of medical data is caused by sparse data and unbalanced sampling, which brings certain limitations to the application of model on mortality risk assessment task. Gaussian Process Regression is a machine learning method developed based on Bayesian theory and statistical learning theory. Its advantage is that the high predictive accuracy of data imputation can be achieved by a small number of hyper parameters.

EHR data is a time-series data that commonly has the missing-value and irregular sampling problem over time. We select Gaussian Process to pre-process the EHR data of each patient. Given a training set {

In Probability and Gaussian theory, the Gaussian process is a random process on the observations is a continuous variable. Here, we can assume that all the features about the patients’ medical records can be continuous and time-based random variable. The most important part of the Gaussian Process can be defined as its mean and kernel function.

Here,

As for data set

The

Here, _{n} is the identity matrix. When we usually preprocess the data and make it the mean function be 0. Based on the definition of Gaussian Process, the joint distribution of any finite random variables can also satisfy the Gaussian distribution.

Let _{1}, _{2}, _{3},…, _{n}} be the collection of patient’s event occurring time-stamp sequence from the patient record with _{i}): _{i}

The

where, the mean vector

For the choice of the kernel function _{SE}(

Definition 1. Definition A real-valued kernel function

Definition 2. Definition A function

The squared exponential covariance function has only two hyper parameters, namely signal variance

The find the derivative with respective to the _{i}

where,

Then, suppose we want to impute the missing value f(_{k}_{SE}(_{k}_{SE}(_{k}_{k}

And

Then, the new co-variance matrix can be

For each patient, the vector representation of the medical event has the missing-value problem on “

Data imputation is vital to the performance of our pipeline. For example, the missing value of the “_{i} is likely to be affected/similar by _{j} if these two time points are close to each other. However, a potential shortcoming of such method is that the computation workload could be heavy, especially when we want to build the model using more features. If one of the input feature dimension suffers irregularly-sampling or missing value problem, we need to build a new Gaussian Process Regressor. Finally, each patient will be represented as a fixed-length sequence of medical events with imputed data. Such sequence will be the input of our model.

RNN based neural network is currently the state-of-the-art modeling method for sequential data. However, patients’ treatment process usually spams over a long period of time, and there is “vanishing gradient” problem. A variation of recurrent neural network, so called Long Short-Term Memory Unit (LSTM) has the better performance than RNN. The architecture of the LSTM can be viewed as a gated cell. The cell decides which information will be remembered or forgotten through gate opening and closing. By maintaining this gate switch, it allows LSTM to continue to learn over a long-time interval. There are three gate functions in LSTM neurons, namely input gate, output gate and forget gate.

where,

The increasing long term-dependency drives the researches focusing on improving the architecture of the LSTM cell. Based on the classic LSTM, in this paper, a time gate (phase gate) is designed for each hidden layer neuron. namely phased-LSTM. Of course, the structure of the neuron can be further improved, for example, add the filter gate, to improve the performance of the model. So, other than using regular LSTM, we also use phased LSTM [

It adds a new time gate k_{t}. The updates to neuron c_{t} and h_{t} can only be done when the gate is opened. In this way, the input can be periodically sampled to solve the problem of too long input sequence. The open and close of the gate is controlled by independent rhythm represented by three parameters. _{on} control the open phase duration ratio to the entire period.

This time gate allows the Phased-LSTM to solve the irregular sampling problem and also accelerate the training phrase. For example, the opening ratio can be large (close to 1) when the number of medication events inside the current interval is high; otherwise, and the opening ratio will adjust to a small value (close to 0). The number of medication events inside a time interval determines the value of the opening ratio and how much information will be updated to the output layer and hidden layer of Phased-LSTM cell.

The calculation of _{j} and _{j} are performed based on _{t} and _{t} will be denoted as _{t} and _{t}.

Then, we defined the softmax layer that maps the outputs generated by the LSTM and Phased-LSTM cell into the probability representation using _{ti}

The experiments in this paper were carried out on death risk prediction data sets and clinical infusion drug event risk prediction data sets, which were generated from MIMIC III. We randomly split the whole data set into 2/3 training set and 1/3 test set. The MIMIC III database has a total of 46,520 (large number amount of patients’ hospitalization records) patients with fluid-related input records and vital sign records [

For fluid-related input records, we extracted the following information: (1) “

For vital sign records, we extracted the most representative measurements as following: (1) body temperature: abnormal body temperature may be due to fever or hypothermia, or any adverse drug effect. (2) pulse rate: the pulse rate can be included as heart rhythm and the strength of the pulse. (3) respiration rate: fever, illness, or other medical conditions may cause the abnormal respiration rate. (4) blood pressure: the blood pressure can be categorized into 4 stages: normal, elevated, Stage 1 and Stage 2, which reflect the condition of the heart. All of the above measurements, the “

We built the same architecture of LSTM and Phased-LSTM model (

The choice of optimization and loss function are determined by the task of our models and the dataset itself. We compared several loss functions and optimizers. The Cross-Entropy loss and Adaptive Momentum Optimization (Adam) optimizer gave the model the best output.

The output of the model is two probabilities: [_{1}_{2}_{1}_{2}

We first investigate the impact of using Gaussian Process for data imputation on models’ performance to ensure the effectiveness of Guasisan Process and Phased-LSTM indeed can improve dataset and the model performance. During the evaluation of the model performance, we compare the different data imputation approaches that are suitable for time-series missing values imputation. We assume the EHR data are missing random (MAR), which is usually defined as the pattern for a variable is not a function of its observed values because the patient’s medical readings are stochastically changing. The first step is to deal with heterogenous data types by combining the paitent’s numerical information such as vital sign information, drug amount and textual information fluid related medical events such as drug names. Then, we compared several different data imputation approaches: mean imputation (baseline), Autoregressive with exogenous inputs (ARX), Auotoregressive moving average model (ARMA), Autoregressive integrated moving average (ARIMA), and Gaussian Process (GP) for improving the dataset. Next, we compare the results obtained by LSTM model and Phased-LSTM. Finally, we compare the proposed models with other machine learning algorithms. In order to show the Phased-LSTM tackles down the long sequence data, we filter out patients with a small number of medical events and construct the training and testing dataset with the appropriate number of patient instances.

We use ROC curve, precision and recall score to evaluate the model’s performance. The experiment results show the comparison among the LSTM model and Phased-LSTM model with and without the Gaussian Process. Some experiment results are shown in

Precision | SOFA |
OASIS |
LSTM |
Phased-LSTM 0.8287 | |
---|---|---|---|---|---|

Precision | SOFA -- | OASIS -- | LSTM with GP 0.8563 | Phased-LSTM with GP 0.8732 | |

Recall | SOFA 0.6271 | OASIS 0.6407 | LSTM 0.7101 | Phased-LSTM 0.7837 | |

Recall | SOFA -- | OASIS -- | LSTM with GP 0.8321 | Phased-LSTM with GP 0.8567 |

From

The

Mean | ARX | ARMA | ARIMA | GP | |
---|---|---|---|---|---|

Precision | 0.5044 | 0.7749 | 0.8231 | 0.8322 | 0.8732 |

Recall | 0.5126 | 0.7863 | 0.8039 | 0.8419 | 0.8567 |

Missing values, irregular sampling, heterogeneous data types, high dimensionality and long temporal dependency contribute to the difficulty of analysis of health care data, especially in ICU environment. We proposed a data-preprocessing pipeline using statistical approach and natural language processing technique. In addition, we used a new LSTM type called Phased-LSTM to deal with irregular sampling and long temporal dependency inside the data. The experiments show that using the Phased-LSTM framework with the proposed preprocessing pipeline indeed can give us the promising results in the mortality prediction task. Our future work plans to apply our pipeline and model on more complex data by not only including the fluid related medical events. We will also add more vita sign data, other medication events and important device management data. It is hoped that the model can predict risks more accurately, evaluate clinical medication events, and automate the management of important equipment. We also empirically compared with different data imputation methods for improving the HER time series dataset.

Our future work will focus on improving the prediction accuracy of our approach in a real ICU environment by trying different prediction networks and data imputation approaches. In addition, we will also try the neural network based approach for time series imputation such as Bidirectional Recurrent network and End-to-End Generative Adversarial Network (E2GAN).