is there any open source skill benchmark workbench for claude code?
i want to define multiple versions of a skill, acceptance criteria (to be evaluated by a separate agent), and a runner that does repeated runs of different versions of my skill to see if there’s a statsig improvement in any version