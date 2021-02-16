Reinforcement Learning (RL), equipped with large-scale datasets, will provide powerful data-driven supports to a wide range of decision-making problems in healthcare, education, business, and more. Classical RL methods focus on the mean of the total return and, thus, may provide misleading results for the heterogeneous populations in large scale datasets.

We introduce the K-heterogeneous MDP to characterize the sequential decision problems with multi-modal return distribution and proposes Auto-Clustered RL algorithm that can automatically detect and identify homogeneous sub-population, while learning the value functions and the optimal policy for each sub-population. We establish convergence rates and construct confidence intervals for the estimators obtained by the Auto-Clustered RL algorithm. Simulations are conducted to back up our theoretical findings. Empirical study on the well-recognized MIMIC-III dataset shows evidence of value heterogeneity and confirms the advantage of our new method.