3 Comments
User's avatar
Tyler Ransom's avatar

This is useful. I’d love to try this out on Claude Code with Opus 4.6.

Tyler Ransom's avatar

Ok, I ran it with Claude Code Opus 4.6 "high-effort" and it basically tied with the GPT models, but variation in what got detected and what got missed.

the audit report is here: https://github.com/tyleransom/moretti_rep/blob/main/audit_report.md

the updated scorecard based on your post is here:

https://github.com/tyleransom/moretti_rep/blob/main/scoring_comparison.md

my prompts were as follows:

claude --model opus --effort high

/init (took 2m13s)

"using just the materials in this repo (don't do any online searching), i'd like you to fully audit this paper. let me know about questionable econometric assumptions, questionable implementation of econometric methods, bugs in the stata code, or any other issues with the scientific reproducibility or veracity of this paper's methods and claims. then generate a summary report of any and all issues you've found. make the report brief but exceedingly thorough. put the report as a md file in the main level of the repo."

(took 13m50s using multiple sub-agents in parallel; used up 95% of my Claude Pro 5-hour token limit)

I honestly thought Claude Code would do a lot better! But this was still fun and easy to do. It probably took 20-25 mins of wallclock time and I was able to do some other stuff while it was cranking.

Hannes Malmberg's avatar

Cool stuff! An interesting question is how far a coding agent would come that would be able to run the code and iterate.