In this paper, we propose the novel model of combinatorial pure exploration with partial linear feedback (CPE-PL). In CPE-PL, given a combinatorial action space $\mathcal{X} \subseteq \{0,1\}^d$, in each round a learner chooses one action $x \in \mathcal{X}$ to play, obtains a random (possibly nonlinear) reward related to $x$ and an unknown latent vector $\theta \in \mathbb{R}^d$, and observes a partial linear feedback $M_{x} (\theta + \eta)$, where $\eta$ is a zero-mean noise vector and $M_x$ is a transformation matrix for $x$. The objective is to identify the optimal action with the maximum expected reward using as few rounds as possible. We also study the important subproblem of CPE-PL, i.e., combinatorial pure exploration with full-bandit feedback (CPE-BL), in which the learner observes full-bandit feedback (i.e. $M_x = x^{\top}$) and gains linear expected reward $x^{\top} \theta$ after each play. In this paper, we first propose a polynomial-time algorithmic framework for the general CPE-PL problem with novel sample complexity analysis. Then, we propose an adaptive algorithm dedicated to the subproblem CPE-BL with better sample complexity. Our work provides a novel polynomial-time solution to simultaneously address limited feedback, general reward function and combinatorial action space including matroids, matchings, and $s$-$t$ paths.

提出了多项式时间适应性算法和多项式时间算法，以针对全带回馈和非线性奖励函数等多种情况进行组合纯探索问题的处理，对样本复杂度进行了分析。

全赌臂或部分线性反馈下的组合纯探索