Does GitHub Copilot improve code quality? Our causal analysis reveals its true impact on PR size, code coverage, and code smells.
According to The New York Times, we are two to three years away from AI systems capable of doing almost any cognitive task a human can do, a consensus shared by the artificial intelligence labs and the US government.
“I’ve talked to a number of people at firms that do high amounts of coding, and they tell me that by the end of this year or next year, they expect most code will not be written by human beings,” says opinion columnist Ezra Klein in a March 2025 podcast episode.
Given this prediction, we should see software engineering organizations adopting AI coding assistants at blazing speed.
But that is not the case.
Interestingly, many large enterprises with thousands or tens of thousands of software developers have been rolling out code assistants slowly and cautiously, which could be to their competitive detriment. Google, for one, says AI systems already generate over 25% of its new code.
{{cta}}
What are enterprises waiting for? More evidence.
More evidence is needed on the cause and effect of AI-augmented coding. Specifically, they are seeking:
Faros AI, an engineering hub that helps enterprises navigate their AI transformation, conducted causal analysis research to definitively determine Copilot’s impact on code quality—research that can inform the strategy for integrating AI into developer workflows safely and confidently.
While there have been some studies of the effects of Copilot using A/B testing and controlled experiments, the inherent variability in engineering teams, processes, and goals makes these studies very challenging.
In practice, most companies lack the sample size or discipline required to conduct an experiment that addresses the inherent complexities of engineering organizations and the present biases, making simple correlations incomplete.
Without a rigorous approach, the impact figures often cited can be inaccurate or downright misleading. However, by applying causal analysis, as we did at Faros AI, you can overcome the complexity to isolate the true downstream effects of Copilot on engineering workflows.
Indeed, any seasoned software engineering manager or developer will have strong beliefs, but causal analysis aims to back or challenge those beliefs with solid evidence and answer the question: "What is the measurable impact of Copilot on Pull Requests (PRs) and code quality?"
Misunderstanding the cause and effect of Copilot can lead to three types of mistakes when adopting AI:
Causal analysis, broadly speaking, aims to identify the direction and effect of one or more variables (or “treatments”) on an outcome (or “measure” or “metric of interest”).
Unlike traditional analyses that only look at correlations (e.g., “X and Y move together” or “X and Y don’t move together”), causality is concerned with cause and effect.
To illustrate:
Accounting for confounders is critical in developing reliable causal estimates of effect.
In the “smoking causes lung cancer” example, age can be a significant confounder in determining the effect of smoking on lung cancer. Older individuals have a higher baseline risk of developing various cancers, including lung cancer.
If older individuals also happen to smoke more frequently than younger individuals, a simple correlation analysis could mistakenly attribute the entire increase in lung cancer incidents to smoking alone. In reality, part of the observed “smoking effect” might be driven by age, an independent factor that increases lung cancer risk all by itself.
To properly isolate the causal impact of smoking, you have to account for age—meaning you statistically control for it or otherwise remove its influence, so you do not overestimate or underestimate smoking’s effect on lung cancer risk.
In standard correlation-based analytics, you might see a strong relationship between two variables, but you don’t necessarily know which (if either) drives the other. There might be a third variable lurking in the background that influences both.
Sussing out the nature and direction of the relationship between all the potential variables is both the power and the challenge in causal analysis.
To define relationships between variables for causal analysis, you start by examining the variables and data. A necessary first step is making the data observable. If you cannot see and understand your data, you cannot use it.
Faros AI is a powerful tool for this task because it centralizes data from across the software delivery life cycle, standardizes it into a unified data model, and translates raw data into useful measurements with powerful visualizations.
Observing the data alone isn’t enough, however. Drawing conclusions based on observation alone can lead to misinterpretation caused by:
A classic example of spurious correlations is noticing that “the global average temperature seemingly correlates with the prevalence of pirates.” Obviously, neither variable causes the other, and it’s a meaningless coincidence.
Consider observing that "ice cream sales correlate with drowning incidents." This doesn't mean that purchasing ice cream directly causes drownings. Instead, both variables are influenced by a third factor: warm weather. During warmer months, people are more likely to buy ice cream and swim, which can lead to increased drowning incidents.
Similarly, correlations in data may be linked by external factors without a direct causal relationship between them.
Simpson’s Effect (or Simpson’s Paradox) is frequently discussed in data science interviews. It describes a situation where a trend or effect seen in several groups of data reverses when these groups are combined.
Consider the example of baseball players’ batting averages: A player might have a higher batting average than another player in two separate seasons; however, when you combine statistics across both seasons, the second player might end up with a higher overall batting average. This paradox occurs because the number of times each player batted in each season affects averages differently.
True—humans can sometimes rule out spurious, non-causal, and nonsensical explanations using common sense. But it’s also easy to be misled, like in the baseball example.
The classic way of avoiding these issues is to conduct an A/B test, where you only change the variable you care about and then directly measure what happens.
But how do you run an A/B test on an engineering organization where there is so much going on at any given moment that could be impacting performance?
In real-world engineering organizations, any analysis is complicated by the fact that engineers are not all the same. They differ by:
Engineer-specific factors:
Team-specific factors:
Project-specific factors:
Organizational factors:
Repository and codebase factors:
Process and workflow factors:
All of these differences can wreak havoc on attempts to compare one group of engineers (or teams) to another, especially if you ignore these differences when trying to measure a cause-and-effect relationship like “Does Copilot improve code quality?”
If you give Copilot to one team and not another and do a naïve comparison, you’re liable to see the effect of countless confounders. Perhaps Team A is more senior, or perhaps Team B is dealing with more urgent incidents. Without careful methods, you risk attributing differences to Copilot when something else is at play.
Given the complexity of engineering organizations—and the prevalence of confounders like seniority, repository specifics, or team composition—it becomes very important to use methods designed to tease out the real cause-and-effect relationship.
Specifically, we wanted to see:
Our data scientists are domain experts who are deeply knowledgeable about what engineering organizations measure and the factors likely to affect those measurements.
We chose to focus our causal analysis on quality metrics, not on engineering throughput or velocity. Why? In all the surveys our clients conduct and analyze with Faros AI, developers are bullish on Copilot, reporting significant time savings and high satisfaction.
{{cta}}
However, the jury is still out on Copilot's impact on the quality of engineering work, which could be a blind spot for many organizations. If quality is left unchecked, there are many risks related to long-term maintainability, readability, and security.
When the data is messy (we do not have a perfect A/B test or do not have perfectly matched teams), standard correlation measures or simple machine learning models may produce biased or incorrect estimates.
Instead, we used a particular technique within causal analysis that avoids the need to map out every link (like seniority → code coverage → time-to-approve) but still helps us isolate the effect of Copilot usage. The technique is known as double (or “debiased”) machine learning.
Note: The sections below describe double machine learning and its application in gory detail.
If you don't care about the details and just want to know what we found, skip ahead to our results.
Double (debiased) machine learning is a relatively new method in the fields of data science and causal inference (originally published around 2016). Its core concept is to combine machine learning models with a cross-validated approach to control for observable confounders. By “observable confounders,” we mean any measurable variables that might impact both the “treatment” (here, Copilot usage) and the “outcome” (like PR approval time).
Let’s define a few terms in this specific scenario:
Despite the name, double machine learning actually uses three layers of ML models to obtain a final prediction of true causal effect (the “double” part refers to the partialing-out of treatment and outcome residuals, the difference between the predicted and actual values, typically accomplished with two main models). Here is the general structure:
Said differently:
This procedure allows for separating out causal effects without needing to define the functional format for every relationship. This is important because, while it is clear that seniority influences review time, the relationship is unlikely to be linear or consistent across all companies. Defining the precise mathematical correlation between seniority and review time is complex, and accounting for every potentially significant variable is even more challenging. The ability to analyze these relationships without requiring exact predefined formulas is a significant advantage.
For the technique to work well, there are still some very important assumptions that must be met:
Below, we break down how our analysis met these conditions.
Challenge: The GitHub Copilot API does not provide fine-grained data on exactly when, within a PR’s code changes, Copilot was used.
Instead, we developed an approximate measure: the number of times Copilot was accessed in the time window around the moment the PR was marked “Ready for review” (7 days before and 3 days after). We chose this window by examining the median lead time for tasks across customers and selecting a window that covered most tasks’ coding time.
Challenge: The Github Copilot API provides information on when an engineer accessed Copilot, however it does not provide fine-grained information about code generated using Copilot.
For the purposes of this study, we assumed that the times Copilot was accessed correlated with how often the engineer was using it and made the decision to treat “the number of Copilot accesses” as our “treatment variable.” While this is an approximation, it is sufficient to capture how heavily an engineer was relying on Copilot for that PR.
In the future, we’ll enhance our calculations with detailed measurements extracted from the IDE itself, provided by Faros AI IDE extensions like this one, including how much code was created directly using Copilot as well as the pull requests, languages, and repos where the code was used.
As explained above, the causal analysis focused on Copilot’s impact on code quality because uncertainty about the downstream impacts is one reason organizations are rolling out the tool slowly.
In particular, we examined the effects of Copilot usage on:
We chose these specific metrics mainly because PR data is typically very complete, high-quality, and highly indicative of how engineering organizations operate. If there’s no negative outcome in code quality metrics, that is highly reassuring for widespread usage.
To maintain a robust and ongoing double machine learning causal analysis, it is essential to continuously capture all relevant inputs that might affect the outcome or treatment. Excluding critical confounders (or including variables that are effects of your treatment) can lead to overestimating or underestimating the real effect of Copilot.
Our automated machine learning workflow leverages a library called Feature Tools to periodically generate features (variables) for data within the Faros AI standard schema. As Faros AI continually ingests data from a range of engineering tools and normalizes it, we've established a general approach to Copilot analysis that 1) applies universally across all our customers without custom feature engineering and 2) provides a comprehensive set of features for our analysis.
Recognizing that not all Faros AI customers immediately integrate the complete range of engineering data sources (such as calendar information or deployment data) alongside common sources like version control (e.g., GitHub), ticket tracking (e.g., Jira), or incidents (e.g., PagerDuty), our feature definitions are made robust against missing information. This ensures that analyses remain insightful even when some confounding variables are absent. However, it may occasionally lead to overestimations of the effects of Copilot (these estimations will still be more accurate than looking at the completely uncorrected data).
The features provide an ongoing comprehensive view of activities surrounding the creation and review of each PR, accounting for everything from authors and reviewers to repository and team information.
These features were meticulously curated over repeated analyses to ensure none are downstream effects of Copilot usage itself. When you include downstream effects (often called “bad controls”), you can distort the results and underestimate Copilot’s true effect.
We ran several sensitivity analyses across all customers, examining feature importance and effect size to remove any features that, upon reflection, were likely downstream effects.
For example, we removed incident assignments to pull request authors in the post "Ready for review" period from the feature set because they tended to be suspiciously predictive of code coverage, likely indicating that if you don't test your code properly and it causes an incident, you will be the one assigned to fix it.
With a validated workflow for feature creation in place, our model selection process rigorously identifies the best architecture for each model (outcome, treatment, and final residual correlation) across our customers' data.
Using scikit-optimize for Bayesian optimization, we periodically recalibrate hyperparameters for and select between a variety of scikit-learn tree-based models (e.g., RandomForest, HistogramGradientBoosting, and ExtraTrees) every few months. This ensures that we are optimizing model selection specific to each customer's evolving dataset within the double machine learning empirical assessments.
All of the models we evaluate in this model selection step are tree-based. Tree-based models are particularly well suited to double machine learning applications because they naturally capture nonlinear relationships and interactions among variables (such as “Engineer with 3 years of experience + Java usage + a busy schedule may behave differently than a brand-new engineer working in Python,” etc.). This allows the models to capture the complexity of the confounding variables interactions without needing to explicitly define the relationships between them.
After models and hyperparameters are established, the EconML non-parametric double machine learning model is applied to each organization's data. Predictions, feature importance, and final outcomes undergo rigorous verification, including repeated sensitivity analyses and model quality ratings. Models that do not meet established quality standards for a customer (determined by the R-score metric) are systematically excluded from causal effect calculations.
At the conclusion of this analysis, after we examined the results across ten companies using Faros AI to navigate GitHub Copilot adoption and optimization, we found that overall, there are no significant negative effects of Copilot usage on at the organizational level.
In other words:
These important findings indicate that, overall, engineers can continue using Copilot without worrying about a decline in key code quality indicators. Engineers themselves widely report that Copilot helps them move faster, and from a broader organizational perspective, this doesn’t appear to happen at the expense of code quality.
That’s it? Are we done? No. Because that result was overall.
Some teams saw a positive impact on quality, and others saw a negative impact.
While overall there was no significant negative impact on PR quality metrics (a very encouraging finding), individual teams within organizations did show varying levels of effects of Copilot usage on these metrics.
For example, some teams' PR Code Coverage decreased, while for others, PR Diff Sizes increased.
That is where a team-by-team analysis is paramount. We all know that one bad outcome in a highly sensitive area of the code base can reverberate across the entire organization, undermining confidence in AI when it isn’t completely justified. Freezes, moratoriums, and limitations on Copilot usage may follow in an effort to prevent similar incidents.
{{cta}}
What can you do?
As soon as adoption begins, every engineering manager should consider their team’s specific causal analysis and improve key processes around code review and code quality to mitigate any negative impact.
It is always prudent to consider the team’s context and adapt accordingly:
Overall, it is reassuring to discover that engineers can use Copilot without creating a nightmare for the next person to maintain the code. In large engineering organizations, this is a major concern when adopting new tools: “Will it scale? Will it degrade code quality?”
This causal analysis research, powered by double (debiased) machine learning techniques, strongly suggests that, at this point in time, Copilot is overall safe from a code-quality perspective.
If your organization has flexible budgets, investing in Copilot licenses can accelerate development speed while maintaining quality metrics like code coverage and PR review time. Combined with ongoing qualitative feedback from engineers, this is a promising result for broad adoption.
However, it's important to recognize that with evolving technology and practices, the impact on code quality could change. Thus, using a tool like Faros AI to monitor code quality remains essential. Currently, Faros AI determines that using coding assistants does not detrimentally impact code quality, but this could shift with wider adoption, increased reliance on technology, or technological evolution.
Furthermore, any teams with evident negative outcomes could greatly benefit from investigating how Copilot is being used. Are individuals ignoring lint or code smell warnings? Are they relying heavily on generic snippets that clash with architectural constraints? Analyzing these outliers can help refine best practices, allowing teams to implement necessary process changes and share successful strategies to mitigate issues.
For a conversation with a Copilot adoption expert, request a meeting with Faros AI.