GitHub Copilot is one of the fastest adopted tools in the history of software development. One year after its release, over 1 million developers and 20,000 organizations are using the tool. But how to measure its impact on your engineering operations? Read on..
Generative AI has been taking the world by storm over the past year. These AI models have the ability to generate new content, such as images, text and even videos or music, that closely resemble human creations. This technology has immense potential and is already having a deep impact in various domains. In the field of art and design, it has been used to create stunning artwork and realistic graphics. In healthcare, generative AI is assisting in drug discovery by creating models of proteins. In education, chatbots are acting as tutors. In sports, as coaches.
The impact on business is expected to be massive, unlocking new opportunities for growth and innovation. So much of what knowledge workers do is about creating content of different forms. Think product descriptions, blog posts (like this one!), marketing campaigns, knowledge articles, or even product designs, logos, branding, pitch decks, and even entire websites! Leveraging their proprietary data, organizations are rolling out much more powerful chatbots for customer service or internal use. Every knowledge worker is essentially getting a digital coach/assistant that can be trained and fine-tuned for the task at hand.
As a new technology, generative AI still has plenty of issues however, compounded by the fact that it was essentially released to everyone for free and very quickly. Plenty of examples of inaccurate, biased or harmful content being generated. Lots of open questions around copyright infringement as these models were trained from the internet. And the impact on jobs is hotly debated.
To maximize impact and reduce risk, it is critical for organizations rolling out generative AI capabilities in their products and teams to understand the potential and limitations of the technology, follow its (rapid) progress, and provide attentive human oversight by tracking its impact on key business metrics.
One of the most exciting applications of generative AI is in software development.
Almost exactly a year ago, GitHub Copilot was released, built on top of OpenAI GPT3. Trained on huge amounts of code from public repositories, it can write entire blocks of code and help with quintessential software development tasks such as code debugging, refactoring, writing tests or documentation.
What makes Github Copilot so powerful is its deep integration into a developer’s environment. It provides AI-based code completion in response to a developer pressing the “tab” key, a great example of a successful generative AI integration via a constrained UX within an existing workflow. Today, Github Copilot has been activated by more than one million developers in over 20,000 organizations, generating a staggering three billion accepted lines of code according to a recent post by GitHub.
Despite its skyrocketing popularity and undeniable benefits, just like other tools leveraging generative AI technology, GitHub Copilot and similar tools such as Tabnine, Amazon CodeWhisperer, Replit Ghostwriter or FauxPilot, among others, have their limitations and should be rolled out with care.
For one, it might generate code that is not optimal or even sometimes downright buggy. Or it could run fine but not produce the expected output or follow product requirements. Code quality can vary and security or compliance issues can be introduced. Developers leveraging the tool excessively might not understand the code enough to debug it and answer questions coming up during code reviews, lengthening the code review process, or making code harder to maintain. Generated code could also potentially have copyright/plagiarism issues.
Tools like GitHub Copilot are most likely already leveraged by engineers in your organization, or will be very soon. You cannot ignore it as the efficiency gains are huge and teams using them will have an edge. But as we just saw, not managing their rollout and usage could create major headaches for your organization.
In order to properly roll it out, what is most important is to have good visibility into the whole of your software development life cycle.
For example, with its smart autocomplete capabilities, GitHub Copilot can increase coding velocity, reducing the time it takes a developer to submit a PR. But what if the code review takes twice as long because the developer cannot answer review questions from the reviewer?
Or you might be shipping code and closing tickets faster, but spending more time maintaining it or debugging it with an uptick on incidents.
Developer satisfaction may improve by removing some of the tedious tasks such as writing unit tests or documentation, but may also be negatively impacted by the increased time spent reviewing larger PRs with sub-optimal code or testing for security flaws and compliance issues.
As you can see from these examples, it is easy to get the wrong picture if you only focus on limited metrics. It may seem like your velocity is improving because more tickets are closed and time in dev is being reduced, but lead time to production may actually increase with lengthier PR reviews. Velocity could increase but quality be negatively impacted. Developer satisfaction may initially improve by shipping code faster, then decrease by having to maintain code that is not optimal or plagued by security flaws.
To get an accurate view of the impact, benefits and unintended consequences of rolling out tools like GitHub Copilot in your organization, you need full visibility across your entire software development lifecycle. Fortunately you can leverage existing frameworks and tools to do just that.
DORA metrics will help you keep an eye on BOTH velocity and quality. Monitoring Lead Time is a much better way than Ticket Cycle Time to measure actual improvements in what matters: delivering code to your customers in production. And an increase in Change Failure Rate is a red flag that there might be an issue with auto generated code. Engineering Productivity should be carefully analyzed and not reduced to the number of tickets closed in a sprint: pull request merge rates, planned vs unplanned work and team health among others should all be taken into account.
At Faros AI, we work with some of the largest organizations in the world, like Salesforce, Box and Coursera. Many of them are rolling out tools like GitHub Copilot with a mix of excitement and concerns. With teams of thousands or even tens of thousands of engineers, the stakes are high.
Faros AI provides a “single-pane” view across a software engineering team’s work, goals and velocity. You can connect key data sources to the platform (Jira, GitHub and many others) and leverage out-of-the-box modules such as our DORA metrics solution or customize and build your own analytics.
It is the perfect tool for these large organizations to monitor the impact of rolling out tools like Github Copilot and we ran an initial study with a subset of our customers to get some early signals.
For this first study we proceeded in two steps: we conducted interviews of developers using Github Copilot, then used Faros AI’s DORA module to explore metrics for teams using the tool more heavily to see if key delivery metrics were impacted.
The first learning from this study is that some organizations did not really have a good sense of how much tools like Github Copilot had actually penetrated their organizations. It had grown organically and somewhat below the radar. Some groups were heavy users, while others were not using them at all.
In terms of actual usage, the key way Copilot is used today is for code autocomplete. Most developers we talked to praised that functionality and were heavy users. Key use cases mentioned were writing boilerplate code, skeleton code, code comments and tests. All these amount to micro-savings, basically saving keystrokes, but accumulate throughout the day, and developers we talked to cited productivity gains upwards of 20% on coding work from this alone.
In terms of code suggestions, opinions varied. Some developers complained about them being too noisy/chatty, although they noted recent improvements from what they had experienced a few months ago. Hit rate was deemed low (~25%), especially on more complex code, but could sometimes be helpful as a starting point. For this task, another tool was actually preferred: chatGPT itself. Several developers we talked to used it actually even more than Copilot. Common examples given included generating code snippets from specs, translating from one programming language to another (for programmers starting on new languages), as an alternative to writing a script for tasks such as search and replace to write similar pieces of code, and as a tutor for debugging. Some developers cited time savings of over 1h per day leveraging chatGPT in this way.
A key theme throughout these interviews was that developers don’t really trust these solutions yet, describing them as “a junior assistant that is very energetic but often wrong”. All of them indicated using them on small chunks of code to be able to verify, as errors were expected and would be harder to find if too much code was generated at once. The main concern expressed was around introducing quality issues in edge case scenarios. While we talked mostly to senior developers, concerns were expressed around potential impact of these tools in the hands of more novice programmers who might lack the experience to spot such issues.
The next step was looking at the data. Once information was collected on which teams were using the tools more heavily, it was easy using Faros to conduct an A/B analysis and a before/after comparison, as the DORA metrics can be filtered down to the team level and charted over time.
When doing so, we observed, at this point in time and with a limited sample, that overall velocity for teams using copilot was not significantly different from those not using it, and had not changed that much before and after using it. Diving deeper was even more interesting and started to explain why: as Faros provides a breakdown of the lead time to change steps, it was clear that often the biggest bottlenecks were actually in the First Review Time, Merge Time and Time in QA parts of the cycle. In other words, potential gains in dev time were dwarfed by time spent in other stages of the pipeline, and as a result lead time to production, which is what really mattered, barely moved. This in itself was a powerful insight and several of our customers implemented PR review policy changes as a result.
Our second study will be looking at additional aspects, including quality and productivity metrics and we will be looking forward to sharing those results with you soon!
Generative AI is reshaping the business landscape. Tools like Github Copilot are most likely already being used in your organization, or soon will be, and you cannot ignore it. Efficiency gains can be huge and give your teams an edge. That being said, to properly roll it out, reap its benefits while addressing issues, you need good visibility into the WHOLE of your software intelligence life cycle. Tools like Faros AI can give you this visibility. The time is now.
Request a Demo and we will be happy to set up time to walk you through the latest advancements in our platform.
Global enterprises trust Faros AI to accelerate their engineering operations. Give us 30 minutes of your time and see it for yourself.