I was looking at: https://arxiv.org/abs/2506.18254 but your approach is even more general.
EDIT: by shared I only mean the adjacency to LLMs/AI/ML, RL is a pretty big differentiator though and project looks great
We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!