Can AI do replications? GPT5.2 vs GPT5.4 vs refine.ink
AI on ten issues in Moretti (2021)
In my comment on Moretti (2021) (M21), I found ten problems in the paper. Since only I know what the problems are, I can test whether AI is able to do my job. Specifically, I want to test whether AI can detect the problems in the paper, using the original text and code. I run the test with GPT5.2 and GPT5.4 Extended Thinking, and with refine.ink.
For each of the ten issues, I uploaded PDFs of the paper and README and the relevant code files, and asked GPT to evaluate (1) whether the method is conceptually sound, and (2) whether the code correctly implements the method. For refine, I uploaded a combined PDF of the paper and code.1 My goal is to test pure reasoning ability in evaluating the econometric methods and code, rather than being able to parse the directory and find the relevant code. See the footnote for my generic prompt.2 If you want to try this yourself, here’s a repo with the paper and code. Here’s the original replication package.
Issue 1: Figure 5 distributed lag model
Figure 5 is implementing a distributed lag (DL) model to test for rising-star inventors selecting into larger clusters. As discussed here, M21 does not present this as a DL model, and does not realize that this is the most demanding case (basically, an event study with a different event in every period); so the identifying assumptions are not presented or defended. There’s also a coding error, with the leads and lags calculated using Stata’s _n operator, which refers to different observations; but the panel is unbalanced, so the previous observation need not be the previous year. Correcting this shrinks the sample and the estimate loses significance.
GPT5.2 (26min thinking)
GPT notices that the sample is highly selected, with inventors who patent in 11 consecutive years, but didn’t catch the strong assumptions required for the DL model. It does notice the problem of an unbalanced panel, but mistakenly thinks that a different part of the cleaning code (that does create a balanced panel) is relevant here. So it understates the issue of the leads/lags being incorrect. In a separate response on Figure A1, GPT does more firmly flag the leads/lags issue, but totally misses that B_0 is incorrectly missing from the figure!
Grade: 4/10
GPT5.4 (2min thinking)
GPT doesn’t catch the strong DL assumptions required here. But it does correctly flag using _n to define leads/lags: “x_p1 is not necessarily cluster size at t+1”. (Now it notices that the interpolation code doesn’t apply, which GPT5.2 got wrong.)
Grade: 8/10
Refine
Refine didn’t flag Figure 5. (It also seems to have an issue reading the code; it thinks some of the underscores are missing in _n.)
Grade: 0/10
Issue 2: Figure 6 event study
Figure 6 is doing a mover event study to again test for selection, interacting pre- and post-move cluster size with event time indicators. The problems here are subtle: the original event study method is incorrect and does not use variation from moving, and the code does not implement the (incorrect) method described in the text, since there’s no interaction term for post-move cluster size with the t=0 indicator; instead, the regression incorrectly uses time-varying cluster size to estimate B_0. (See full discussion here.)
GPT5.2 (16min thinking)
Here GPT is confused by the confusing presentation in M21. Figure 6 is supposed to be an event study, but the figure notes say that it plots the leads and lags as in Figure 5; GPT doesn’t pick up that these are supposed to be interactions of cluster size with event time indicators. It also doesn’t notice that the method described in the text is an improper event study. It does correctly notice that the coefficients are plotted backwards, and that the regression uses time-varying cluster size to estimate B_0.
Grade: 5/10
GPT5.4 (5min thinking)
GPT again misses that interacting pre- and post-move cluster size with event time indicators is not a proper event study. It does flag the problem with using time-varying cluster size to estimate B_0, and notices some other problems.
Grade: 6/10
Refine
Refine does catch the use of time-varying cluster size for B_0, and it flags that Figure 6b (calculating the cumulative response) is incorrect for event study coefficients. It also flags the backwards ordering. But it doesn’t notice the incorrect event study specification.
Grade: 6/10
Issue 3: Table 5 IV
The IV has a straightforward coding error: it does not sort by city when constructing the instrument, so the code incorrectly defines a first-difference across cities.
GPT5.2 (17min thinking)
GPT notices that the exclusion restriction is questionable: firm-level shocks can increase patenting in all cities, so using patenting in firms in other cities to instrument for cluster size doesn’t work. And it does find the fatal flaw in the code, where the data is not sorted by city before calculating the first difference: “the 2SLS estimate is not interpretable.”
Grade: 10/10
GPT5.4 (4min thinking)
GPT again notices the strong assumption on the exclusion restriction, and the failure to sort by city.
Grade: 10/10
Refine
Refine catches the sorting error. It also points out problems with defining the IV in first-differences.
Grade: 10/10
Issue 4: Table 6 citations
Table 6 tests for the effect of cluster size on patent quality, as measured by citations. Do patents created in larger clusters have more citations? There are two issues here. First, the text claims to use log citations, but the code uses log(y+0.00001), which puts a very large weight on the extensive margin effect. A better approach is Poisson regression, or at least log(y+1). Second, M21 wants to adjust for the number of coauthors per patent; but here it calculates citations per patent using citations per coauthor and patents per coauthor, which nullifies the per-coauthor adjustment. The code should use whole patents in the denominator to calculate fractional citations per patent.
GPT5.2 (14min thinking)
GPT picks up the log(y+0.00001) issue, noting that this is an extreme transformation. (In fact, it flips the sign of the estimate.) But it doesn’t detect the per-coauthor issue, where the code calculates citations per patent using the per-coauthor versions of both; noticing this requires going back to the definition of each variable in another cleaning file. (However, it does correctly notice a many-to-many merge issue in columns 7-8.)
Grade: 7/10
GPT5.4 (12min thinking)
Here GPT surprisingly does not flag log(y+0.00001). And it again misses the per-coauthor issue. It does note the m:m merge and other issues.
Grade: 4/10
Refine
Refine does not flag log(y+0.00001), but does note the m:m merge.
Grade: 1/10
Issue 5: Table 8 heterogeneity by cluster size
Table 8 tests whether the patenting-size elasticity varies by cluster size itself, using quartiles of cluster size. It runs an interaction regression, interacting cluster size with quartile indicators. There’s a clear coding error in Panel A: M21 omits the quartile indicators themselves from the regression, which forces the baseline to be the same across quartiles.
GPT5.2 (14min thinking)
GPT misses the coding error. In the past, is has been able to detect it. (It does flag an important issue: whether quartiles are calculated at the cluster level or the inventor level.)
Grade: 1/10
GPT5.4 (3min thinking)
Here GPT does catch the omitted quartile indicators (and notices that Table 8 Panel B does correctly include them).
Grade: 10/10
Refine
Refine doesn’t catch the omitted quartile indicators. It does note the quartile level issue.
Grade: 1/10
Issue 6: Table A6 interpolation
Table A6 addresses the issue of inventors not appearing in the data during years where they don’t patent. It imputes these missing observations for gaps of length 1 and 2 years, by assigning patent=0 and the city from the year before the gap. There’s a small coding error: for 2-year gaps, only the second year of the gap is filled. Moreover, the text claims to interpolate values for both movers (with a different city before and after the gap) and stayers, but the code only does stayers.
GPT5.2 (11min thinking)
GPT does notice that the code only fills the second year of 2-year gaps. It doesn’t notice the mover/stayer issue. (It also helpfully notes that M21 doesn’t recalculate cluster size after imputing the missing observations. And it notices that M21 does not have a unique inventor identifier, but simply treats names as unique.)
Grade: 7/10
GPT5.4 (5min thinking)
GPT does not notice the 2-year gap problem, or the mover/stayer issue. It does notice that M21 doesn’t recalculate cluster size.
Grade: 2/10
Refine
Not mentioned.
Grade: 0/10
Issue 7: Table A7 time unit
Table A7 varies the time unit of the data. M21 thinks this is addressing the missing extensive margin, where inventors do not patent and are not observed. But this interpretation is incorrect, because at no point are zeros observed, so no extensive margin effect can be estimated. Moreover, M21 doesn’t redefine cluster size at the new time unit, but uses the baseline 1-year definition of cluster size.
GPT5.2 (10min thinking)
GPT is skeptical of M21’s argument, noting that changing the time unit means changing the sample. But it doesn’t get that M21 is confused about the extensive margin. It does catch that cluster size isn’t redefined at the new time unit.
Grade: 3/10
GPT5.4 (3min thinking)
GPT again misses the extensive margin confusion. It does catch that M21 uses 1-year cluster size instead of redefining size at the new time unit.
Grade: 3/10
Refine
Not mentioned.
Grade: 0/10
Issue 8: Table A8 cluster quality
Table A8 Columns 1-4 test for a larger elasticity in higher-quality clusters. Two issues here. First, M21 calculates cluster size incorrectly, taking the average size when inventors patent in multiple clusters; the text says that M21 uses the modal cluster size. Second, M21 forgets to adjust for the number of coauthors, which treats inventors on larger teams as higher quality.
GPT5.2 (17min thinking)
GPT does flag that the code is taking average cluster size rather than modal size. It doesn’t catch the per-coauthor issue. (It does helpfully note that using lifetime output to define quality is using a post-treatment variable. It also flags the many-to-many merges.)
Grade: 6/10
GPT5.4 (2min thinking)
Here GPT does not catch either issue.
Grade: 1/10
Refine
Not mentioned.
Grade: 0/10
Issue 9: Table A8 team size
Table A8 Columns 5-8 try to address supposed confounding from team size (number of coauthors per patent). This is conceptually confused, because team size is a mediator, not a confounder: cluster size affects patenting indirectly through larger teams, which independently increase patenting. Moreover, the code already uses fractional patents (dividing by team size), so M21 is adjusting for team size twice; team size is in the denominator of the left-hand side and is a control variable on the right-hand side.
GPT5.2 (9min thinking)
GPT does recognize the “bad control” problem, where team size is a causal channel through which cluster size affects productivity. But it doesn’t catch the double-adjustment problem.
Grade: 5/10
GPT5.4 (3min thinking)
GPT again mentions bad controls, and misses the double adjustment.
Grade: 5/10
Refine
Refine does flag the incorrect interpretation of team size as a confounder. It misses the double adjustment.
Grade: 5/10
Issue 10: unreproducible data cleaning
The cleaning code in M21 uses many-to-many merges that make the results unreproducible. Different runs of the code produce different datasets.
GPT5.2 (6min thinking)
GPT picks up the many-to-many merges. It now notices the mover/stayer issue with interpolation that it missed above. It again notices the lack of unique identifier.
Grade: 10/10
GPT5.4 (4min thinking)
GPT again notices the main problems: many-to-many merges, lack of unique identifier, and problems with the calculation of cluster size.
Grade: 10/10
Refine
Refine does not discuss the cleaning code.
Grade: 0/10
Takeaways
Here’s my rough average scoring (treating all problems as equally important):
GPT5.2: 5.8/10
GPT5.4: 5.9/10
Refine: 2.3/10
So GPT5.4 was only slightly better than 5.2. However, the GPT scores are somewhat noisy: as mentioned, GPT5.2 has been able to detect the Table 8 coding error in the past, but missed it here. And 5.2 catches the cluster quality issue, while 5.4 missed it.3
GPT5.2 was more sensitive to the prompt. As an example, I mentioned in the generic prompt to be aware of issues like an unbalanced panel. GPT5.2 really anchored on this, and brought up a part of the cleaning code that does create an unbalanced panel, but only for Table A6. It mistakenly took this to apply to all of the results. Better prompting could help.
For GPT5.4, I created a project with the PDFs and code files attached as sources, which each session had access to; for 5.2 I attached the PDFs and relevant code to each session. This led to much shorter thinking times for 5.4, perhaps related to processing PDFs. But the extra context seemed to help, and 5.4 avoided fixating on the unbalanced panel.
Refine looks at the whole paper at once; you can’t tell it to focus on a specific result. This likely explains the lower score. It’s also not designed for evaluating code.
Overall, I think this proves that current reasoning LLMs are useful as a first pass or an independent check when evaluating a paper.
refine doesn’t have an input prompt.
Here’s the prompt I used for each GPT5.2 session:
Evaluate the main results in the paper. For each result, ask: Is the method conceptually sound? Does the code correctly implement the method?
Make sure to point out anything that could be a fatal flaw; eg. if the code works assuming the panel is balanced, but fails when the panel is unbalanced, then flag this as a serious problem, because panel data is often unbalanced. Don’t assume the code works just because the panel might be balanced. Apply this lesson generally. Don’t be overly nice to the paper; we care about the truth, and are scrutinizing the paper to see if we can trust it in the real world.
Focus on major issues that could flip the results. Don’t focus on small issues like apparent typos; note that in Stata, it is legal to use shortened versions of a variable name when the context is unambiguous (eg.inventorforinventor_id, ororgfororg_id).
The attached code is in .txt files. The attached README mentions .do files, but I converted these to .txt so you can read them.
Summarize your discussion with one sentence each for the soundness of the method and the correctness of the code.
Note that I’m testing only for false negatives here, by giving it cases with known problems and seeing if it can detect them. We could also test for false positives by giving it the correct results (eg. Table 3) and seeing if it finds a nonexistent problem.
