Contact us
Tell us what you want to achieve with Faros AI and we’ll show you how.
Want to learn more about Faros AI?

Thank you!

A Faros AI expert will reach out to schedule a time to talk.
P.S. If you don't see it within one business day, please check your spam folder.
Oops! Something went wrong while submitting the form.

Does Copilot Improve Code Quality? The Cause and Effect Data Is In

Does GitHub Copilot improve code quality? Our causal analysis reveals its true impact on PR size, code coverage, and code smells.

Leah McGuire
Leah McGuire
Cause and effect relationship with Copilot logo
20
min read
Browse Chapters
Share
March 13, 2025

AI-assisted coding is here. And yet, adoption lags.

According to The New York Times, we are two to three years away from AI systems capable of doing almost any cognitive task a human can do, a consensus shared by the artificial intelligence labs and the US government.

“I’ve talked to a number of people at firms that do high amounts of coding, and they tell me that by the end of this year or next year, they expect most code will not be written by human beings,” says opinion columnist Ezra Klein in a March 2025 podcast episode.

Given this prediction, we should see software engineering organizations adopting AI coding assistants at blazing speed. 

But that is not the case. 

Interestingly, many large enterprises with thousands or tens of thousands of software developers have been rolling out code assistants slowly and cautiously, which could be to their competitive detriment. Google, for one, says AI systems already generate over 25% of its new code.

{{cta}}

What are enterprises waiting for? More evidence. 

More evidence is needed on the cause and effect of AI-augmented coding. Specifically, they are seeking:  

  • ROI proof that the improved velocity and productivity justifies the extra license costs.
  • Proof that tools like GitHub Copilot improve code quality (or don’t make it worse).

Faros AI, an engineering hub that helps enterprises navigate their AI transformation, conducted causal analysis research to definitively determine Copilot’s impact on code quality—research that can inform the strategy for integrating AI into developer workflows safely and confidently.

Where correlation and A/B testing can be misleading

While there have been some studies of the effects of Copilot using A/B testing and controlled experiments, the inherent variability in engineering teams, processes, and goals makes these studies very challenging. 

In practice, most companies lack the sample size or discipline required to conduct an experiment that addresses the inherent complexities of engineering organizations and the present biases, making simple correlations incomplete. 

Without a rigorous approach, the impact figures often cited can be inaccurate or downright misleading. However, by applying causal analysis, as we did at Faros AI, you can overcome the complexity to isolate the true downstream effects of Copilot on engineering workflows. 

Indeed, any seasoned software engineering manager or developer will have strong beliefs, but causal analysis aims to back or challenge those beliefs with solid evidence and answer the question: "What is the measurable impact of Copilot on Pull Requests (PRs) and code quality?" 

Causal analysis prevents three mistakes

Misunderstanding the cause and effect of Copilot can lead to three types of mistakes when adopting AI:   

  1. Moving too slow.
    Example:
    You run a small pilot but cannot prove Copilot is having a positive effect on productivity or quality.
    Mistake: You don’t distribute more licenses for months (or years?).
    Consequence: Most of your organization misses out on the benefits of AI-augmented coding, and you’re outpaced by competition. 
  2. Moving too fast.
    Example:
    You provide Copilot licenses to everyone.
    Mistake: You don’t address the changes Copilot is causing downstream in reviews, testing, or deployments.
    Consequence: Your coding velocity gains are erased, and the business sees no positive outcomes. You decide to divest from the tool.
  3. Moving in the wrong direction.
    Example:
    You misattribute velocity improvements to Copilot when they are actually caused by something else.
    Mistake: You make incorrect assumptions in planning about your staffing, skills, and capacity.
    Consequence: You don’t allocate enough resources and miss critical business commitments.
Summary illustration of the mistakes copilot causal analysis can prevent
Causal analysis helps prevent three mistakes in AI adoption

What is causal analysis?

Causal analysis, broadly speaking, aims to identify the direction and effect of one or more variables (or “treatments”) on an outcome (or “measure” or “metric of interest”). 

Unlike traditional analyses that only look at correlations (e.g., “X and Y move together” or “X and Y don’t move together”), causality is concerned with cause and effect. 

To illustrate:

  • Causality is about establishing “smoking causes lung cancer,” not the other way around.
  • Causality wants to ensure that no hidden external factors (confounders) contribute to the measured effect of the treatment or variable of interest.

Accounting for confounders is critical in developing reliable causal estimates of effect. 

In the “smoking causes lung cancer” example, age can be a significant confounder in determining the effect of smoking on lung cancer. Older individuals have a higher baseline risk of developing various cancers, including lung cancer. 

If older individuals also happen to smoke more frequently than younger individuals, a simple correlation analysis could mistakenly attribute the entire increase in lung cancer incidents to smoking alone. In reality, part of the observed “smoking effect” might be driven by age, an independent factor that increases lung cancer risk all by itself.

Illustration of the relationship of age, smoking and cancer
Isolating the causal impact of smoking must account for age

To properly isolate the causal impact of smoking, you have to account for age—meaning you statistically control for it or otherwise remove its influence, so you do not overestimate or underestimate smoking’s effect on lung cancer risk.  

In standard correlation-based analytics, you might see a strong relationship between two variables, but you don’t necessarily know which (if either) drives the other. There might be a third variable lurking in the background that influences both. 

Sussing out the nature and direction of the relationship between all the potential variables is both the power and the challenge in causal analysis. 

Distinguishing correlation from causation

To define relationships between variables for causal analysis, you start by examining the variables and data. A necessary first step is making the data observable. If you cannot see and understand your data, you cannot use it.

Faros AI is a powerful tool for this task because it centralizes data from across the software delivery life cycle, standardizes it into a unified data model, and translates raw data into useful measurements with powerful visualizations. 

Observing the data alone isn’t enough, however. Drawing conclusions based on observation alone can lead to misinterpretation caused by: 

  • Spurious correlations
  • Non-causal relationships
  • Simpson’s effect

Pitfall 1: Spurious correlations

A classic example of spurious correlations is noticing that “the global average temperature seemingly correlates with the prevalence of pirates.” Obviously, neither variable causes the other, and it’s a meaningless coincidence.

Example of a spurious correlation between piracy and global warming
Example of a spurious correlation

Pitfall 2: Non-causal relationships

Consider observing that "ice cream sales correlate with drowning incidents." This doesn't mean that purchasing ice cream directly causes drownings. Instead, both variables are influenced by a third factor: warm weather. During warmer months, people are more likely to buy ice cream and swim, which can lead to increased drowning incidents.

Similarly, correlations in data may be linked by external factors without a direct causal relationship between them.

Pitfall 3: Simpson’s Effect

Simpson’s Effect (or Simpson’s Paradox) is frequently discussed in data science interviews. It describes a situation where a trend or effect seen in several groups of data reverses when these groups are combined. 

Consider the example of baseball players’ batting averages: A player might have a higher batting average than another player in two separate seasons; however, when you combine statistics across both seasons, the second player might end up with a higher overall batting average. This paradox occurs because the number of times each player batted in each season affects averages differently.

Example of Simpson's Effect
Example of Simpson's Effect

Determining causality in a dizzying engineering ecosystem

True—humans can sometimes rule out spurious, non-causal, and nonsensical explanations using common sense. But it’s also easy to be misled, like in the baseball example. 

The classic way of avoiding these issues is to conduct an A/B test, where you only change the variable you care about and then directly measure what happens. 

But how do you run an A/B test on an engineering organization where there is so much going on at any given moment that could be impacting performance?

Many simultaneous factors may impact engineering performance during Copilot A/B tests

In real-world engineering organizations, any analysis is complicated by the fact that engineers are not all the same. They differ by:

Engineer-specific factors:

  • Seniority and experience level: Junior, mid-level, or senior, which could affect coding proficiency and tool utilization.
  • Tenure with the company: Experience specific to the company, which might impact familiarity with internal systems or codebases.
  • Technical skillset: Specific language proficiency and technical competencies, which can influence the ease of tool adoption.
  • Past performance metrics: Historical productivity and quality metrics.
  • Preference for tools: Personal affinity or bias towards using certain tools, including AI-assisted ones like Copilot.
  • Human factors: Individual variance in adaptability and learning new technologies. Current workloads, motivation, and stress levels.

Team-specific factors:

  • Team size: Smaller versus larger teams might have different dynamics and integration strategies.
  • Team composition: Diversity in roles and responsibilities, including the distribution of expertise.
  • Team communication and collaboration: Established protocols and efficiency in communication.
  • Cohesiveness and culture: How well team members work together and the team's openness to technology adoption.

Project-specific factors:

  • Project complexity: Varying levels of complexity can affect productivity metrics and tool effectiveness.
  • Project timeline: The urgency and deadlines can pressure tool usage and productivity outcomes.
  • Customer demands: Pressure from clients or stakeholders affecting team dynamics.

Organizational factors:

  • Resources and infrastructure: Availability of hardware, software, and network resources.
  • Company culture and policies: Attitudes towards innovation and tool adoption policies.
  • Support and training: Access to training programs and support for new tools.
  • Change management practices: How the organization manages transitions to new tools.

Repository and codebase factors:

  • Repository size and structure: Complexity and organization of the code repositories.
  • Historical code quality: Prior metrics on code quality within the repository.
  • Version control practices: How versioning and code changes are managed.

Process and workflow factors:

  • Development practices: Use of agile, waterfall, DevOps, or other methodologies.
  • Code review processes: Protocols for reviewing and approving code changes.
  • Testing and CI/CD pipelines: How testing and continuous integration are managed.
  • Industry trends: How industry-wide shifts influence tool adoption.

All of these differences can wreak havoc on attempts to compare one group of engineers (or teams) to another, especially if you ignore these differences when trying to measure a cause-and-effect relationship like “Does Copilot improve code quality?”

If you give Copilot to one team and not another and do a naïve comparison, you’re liable to see the effect of countless confounders. Perhaps Team A is more senior, or perhaps Team B is dealing with more urgent incidents. Without careful methods, you risk attributing differences to Copilot when something else is at play.

Causal analysis for Copilot’s impact on code quality

Given the complexity of engineering organizations—and the prevalence of confounders like seniority, repository specifics, or team composition—it becomes very important to use methods designed to tease out the real cause-and-effect relationship. 

Specifically, we wanted to see: 

  • Does GitHub Copilot improve code quality?
  • Does GitHub Copilot improve code coverage? 
  • Does GitHub Copilot reduce code smells? 
  • Does GitHub Copilot speed up PR reviews? 
  • … Or does it have no effect at all?

Our data scientists are domain experts who are deeply knowledgeable about what engineering organizations measure and the factors likely to affect those measurements. 

We chose to focus our causal analysis on quality metrics, not on engineering throughput or velocity. Why? In all the surveys our clients conduct and analyze with Faros AI, developers are bullish on Copilot, reporting significant time savings and high satisfaction. 

{{cta}}

However, the jury is still out on Copilot's impact on the quality of engineering work, which could be a blind spot for many organizations. If quality is left unchecked, there are many risks related to long-term maintainability, readability, and security. 

When the data is messy (we do not have a perfect A/B test or do not have perfectly matched teams), standard correlation measures or simple machine learning models may produce biased or incorrect estimates. 

Instead, we used a particular technique within causal analysis that avoids the need to map out every link (like seniority → code coverage → time-to-approve) but still helps us isolate the effect of Copilot usage. The technique is known as double (or “debiased”) machine learning

Note: The sections below describe double machine learning and its application in gory detail. 

If you don't care about the details and just want to know what we found, skip ahead to our results.

Applying double (debiased) machine learning to Copilot impact

Double (debiased) machine learning is a relatively new method in the fields of data science and causal inference (originally published around 2016). Its core concept is to combine machine learning models with a cross-validated approach to control for observable confounders. By “observable confounders,” we mean any measurable variables that might impact both the “treatment” (here, Copilot usage) and the “outcome” (like PR approval time).

Let’s define a few terms in this specific scenario:

  • Confounding Variables (Confounders): An engineer’s seniority, tenure, assigned repositories, programming languages, their team’s workload and composition specifics, the incident load and meeting load on author and reviewer, etc.
  • Treatment: The number of times Copilot was used during the creation of a PR.
  • Outcome (or Metric of Interest): PR Review Time, Code Coverage, PR Diff Size, Code Smells, etc.

Despite the name, double machine learning actually uses three layers of ML models to obtain a final prediction of true causal effect (the “double” part refers to the partialing-out of treatment and outcome residuals, the difference between the predicted and actual values, typically accomplished with two main models). Here is the general structure:

  1. Model One: Predict the outcome (i.e., PR review time, code coverage, etc.) from all the confounders.
  2. Model Two: Predict the treatment (the Copilot usage frequency) from all the confounders.
  3. Model Three: Take the residuals (the errors or differences between the actual observed values and model predictions) from Model One and Model Two. If any portion of the variation in Copilot usage is not explained by the confounders, and the confounders do not explain any portion of the variation in the outcome, we model how those residuals correlate.
Double machine learning in the partially linear model (Source: Fuhr, Berens, and Papies)

Said differently:

  • If an engineer has a high predicted PR Review Time (based on their seniority, team info, etc.) but, in reality, they have a shorter PR Review Time, then that difference is a residual from Model One.
  • If an engineer has a high predicted Copilot usage (based on their seniority, team info, etc.) but, in reality, they used Copilot more or less, then that difference is also a residual from Model Two.
  • Model Three sees how the leftover “Copilot usage difference” explains the leftover “Review Time difference,” net of all the other variables. This is repeated across many cross-validation folds to reduce overfitting and produce an unbiased, causal estimate.

This procedure allows for separating out causal effects without needing to define the functional format for every relationship. This is important because, while it is clear that seniority influences review time, the relationship is unlikely to be linear or consistent across all companies. Defining the precise mathematical correlation between seniority and review time is complex, and accounting for every potentially significant variable is even more challenging. The ability to analyze these relationships without requiring exact predefined formulas is a significant advantage.

For the technique to work well, there are still some very important assumptions that must be met: 

  1. You have a good measure of your treatment.
  2. Your data on outcomes of interest is good.
  3. You are measuring and inputting all the potential confounding variables.
  4. You have not included any bad controls (measures that are the result of your treatment included in your inputs) in your confounding variables.

Below, we break down how our analysis met these conditions.

Measuring treatment: How we measured Copilot usage

Challenge: The GitHub Copilot API does not provide fine-grained data on exactly when, within a PR’s code changes, Copilot was used. 

Instead, we developed an approximate measure: the number of times Copilot was accessed in the time window around the moment the PR was marked “Ready for review” (7 days before and 3 days after). We chose this window by examining the median lead time for tasks across customers and selecting a window that covered most tasks’ coding time. 

Challenge: The Github Copilot API provides information on when an engineer accessed Copilot, however it does not provide fine-grained information about code generated using Copilot. 

For the purposes of this study, we assumed that the times Copilot was accessed correlated with how often the engineer was using it and made the decision to treat “the number of Copilot accesses” as our “treatment variable.” While this is an approximation, it is sufficient to capture how heavily an engineer was relying on Copilot for that PR. 

In the future, we’ll enhance our calculations with detailed measurements extracted from the IDE itself, provided by Faros AI IDE extensions like this one, including how much code was created directly using Copilot as well as the pull requests, languages, and repos where the code was used.

Outcomes of interest: PR review time, code coverage, diff size, and code smells

As explained above, the causal analysis focused on Copilot’s impact on code quality because uncertainty about the downstream impacts is one reason organizations are rolling out the tool slowly.  

In particular, we examined the effects of Copilot usage on:

  • PR Approval Time
  • PR Size (diff size)
  • PR Code Coverage
  • PR Code Smells

We chose these specific metrics mainly because PR data is typically very complete, high-quality, and highly indicative of how engineering organizations operate. If there’s no negative outcome in code quality metrics, that is highly reassuring for widespread usage.

Feature engineering and capturing confounders

To maintain a robust and ongoing double machine learning causal analysis, it is essential to continuously capture all relevant inputs that might affect the outcome or treatment. Excluding critical confounders (or including variables that are effects of your treatment) can lead to overestimating or underestimating the real effect of Copilot. 

Our automated machine learning workflow leverages a library called Feature Tools to periodically generate features (variables) for data within the Faros AI standard schema. As Faros AI continually ingests data from a range of engineering tools and normalizes it, we've established a general approach to Copilot analysis that 1) applies universally across all our customers without custom feature engineering and 2) provides a comprehensive set of features for our analysis.

Recognizing that not all Faros AI customers immediately integrate the complete range of engineering data sources (such as calendar information or deployment data) alongside common sources like version control (e.g., GitHub), ticket tracking (e.g., Jira), or incidents (e.g., PagerDuty), our feature definitions are made robust against missing information. This ensures that analyses remain insightful even when some confounding variables are absent. However, it may occasionally lead to overestimations of the effects of Copilot (these estimations will still be more accurate than looking at the completely uncorrected data). 

The features provide an ongoing comprehensive view of activities surrounding the creation and review of each PR, accounting for everything from authors and reviewers to repository and team information.

Simplified entity relationship diagram (ERD) for ML features generated using the FarosAI standard schema

These features were meticulously curated over repeated analyses to ensure none are downstream effects of Copilot usage itself. When you include downstream effects (often called “bad controls”), you can distort the results and underestimate Copilot’s true effect. 

We ran several sensitivity analyses across all customers, examining feature importance and effect size to remove any features that, upon reflection, were likely downstream effects. 

For example, we removed incident assignments to pull request authors in the post "Ready for review" period from the feature set because they tended to be suspiciously predictive of code coverage, likely indicating that if you don't test your code properly and it causes an incident, you will be the one assigned to fix it. 

Model selection and hyperparameter tuning

With a validated workflow for feature creation in place, our model selection process rigorously identifies the best architecture for each model (outcome, treatment, and final residual correlation) across our customers' data. 

Using scikit-optimize for Bayesian optimization, we periodically recalibrate hyperparameters for and select between a variety of scikit-learn tree-based models (e.g., RandomForest, HistogramGradientBoosting, and ExtraTrees) every few months. This ensures that we are optimizing model selection specific to each customer's evolving dataset within the double machine learning empirical assessments.

All of the models we evaluate in this model selection step are tree-based. Tree-based models are particularly well suited to double machine learning applications because they naturally capture nonlinear relationships and interactions among variables (such as “Engineer with 3 years of experience + Java usage + a busy schedule may behave differently than a brand-new engineer working in Python,” etc.). This allows the models to capture the complexity of the confounding variables interactions without needing to explicitly define the relationships between them.

After models and hyperparameters are established, the EconML non-parametric double machine learning model is applied to each organization's data. Predictions, feature importance, and final outcomes undergo rigorous verification, including repeated sensitivity analyses and model quality ratings. Models that do not meet established quality standards for a customer (determined by the R-score metric) are systematically excluded from causal effect calculations.

Causal analysis results - Does GitHub Copilot improve code quality in PRs

At the conclusion of this analysis, after we examined the results across ten companies using Faros AI to navigate GitHub Copilot adoption and optimization, we found that overall, there are no significant negative effects of Copilot usage on at the organizational level. 

In other words:

  • PR Approval Time was not adversely affected by Copilot usage.
  • PR Size did not experience detrimental changes.
  • Code Test Coverage was not decreased.
  • Code Smells did not increase.

These important findings indicate that, overall, engineers can continue using Copilot without worrying about a decline in key code quality indicators. Engineers themselves widely report that Copilot helps them move faster, and from a broader organizational perspective, this doesn’t appear to happen at the expense of code quality.

That’s it? Are we done? No. Because that result was overall

Some teams saw a positive impact on quality, and others saw a negative impact.

Team coding practices are a major factor in AI-generated code’s quality

While overall there was no significant negative impact on PR quality metrics (a very encouraging finding), individual teams within organizations did show varying levels of effects of Copilot usage on these metrics. 

For example, some teams' PR Code Coverage decreased, while for others, PR Diff Sizes increased.  

That is where a team-by-team analysis is paramount. We all know that one bad outcome in a highly sensitive area of the code base can reverberate across the entire organization, undermining confidence in AI when it isn’t completely justified. Freezes, moratoriums, and limitations on Copilot usage may follow in an effort to prevent similar incidents. 

{{cta}}

What can you do? 

As soon as adoption begins, every engineering manager should consider their team’s specific causal analysis and improve key processes around code review and code quality to mitigate any negative impact. 

It is always prudent to consider the team’s context and adapt accordingly:

  • Are teams tackling exceptionally difficult or domain-specific issues?
  • Are new developers relying too heavily on AI-suggested code and missing some deeper design best practices?
  • Does the team have good testing and review practices in place? How does introducing Copilot affect these practices?

Conclusion: Reassuring news for Copilot adoption

Overall, it is reassuring to discover that engineers can use Copilot without creating a nightmare for the next person to maintain the code. In large engineering organizations, this is a major concern when adopting new tools: “Will it scale? Will it degrade code quality?” 

This causal analysis research, powered by double (debiased) machine learning techniques, strongly suggests that, at this point in time, Copilot is overall safe from a code-quality perspective. 

If your organization has flexible budgets, investing in Copilot licenses can accelerate development speed while maintaining quality metrics like code coverage and PR review time. Combined with ongoing qualitative feedback from engineers, this is a promising result for broad adoption.

However, it's important to recognize that with evolving technology and practices, the impact on code quality could change. Thus, using a tool like Faros AI to monitor code quality remains essential. Currently, Faros AI determines that using coding assistants does not detrimentally impact code quality, but this could shift with wider adoption, increased reliance on technology, or technological evolution.

Furthermore, any teams with evident negative outcomes could greatly benefit from investigating how Copilot is being used. Are individuals ignoring lint or code smell warnings? Are they relying heavily on generic snippets that clash with architectural constraints? Analyzing these outliers can help refine best practices, allowing teams to implement necessary process changes and share successful strategies to mitigate issues.

For a conversation with a Copilot adoption expert, request a meeting with Faros AI. 

Contact us
Tell us what you want to achieve with Faros AI and we’ll show you how.
Want to learn more about Faros AI?

Thank you!

You will get an email soon. Feel free to download Faros AI Community Edition.
Oops! Something went wrong while submitting the form.

More articles for you

Image featuring a press release announcement with the text: 'Faros AI and Globant Announce Partnership to Drive Faster and More Efficient Agentic AI-Based Projects.' The Faros AI logo is displayed in a light blue circle, and the Globant logo is shown in a green circle, set against a dark blue background.
Editor's Pick
News
AI
3
MIN READ

Globant and Faros AI Announce Partnership to Drive Faster and More Efficient Agentic AI-Based Projects

Through its collaboration with Faros AI, Globant will introduce an innovative method for measuring and extracting insights about its team performance and impact on business results, particularly within the software development life cycle.
December 11, 2024
Editor's Pick
AI
15
MIN READ

How much code is AI-generated?

AI generates 25% of Google’s new code. Other organizations seek similar insights to mitigate the risks of this new age of AI-driven development.
November 23, 2024
Editor's Pick
AI
18
MIN READ

Mastering Domain-Specific Language Output: Getting an LLM to Perform Reliably without Fine-tuning

See how real-world user insights drove the latest evolution of Faros AI’s Chat-Based Query Helper—now delivering responses 5x more accurate and impactful than leading models.
November 8, 2024