What’s the difference between these pull and push options and which approach may work best for your data source?
Business intelligence platforms, particularly those targeting the software engineering space, play a crucial role in centralizing data from many sources to support business operations. These platforms provide teams and leaders with a holistic view of their software development processes, enabling them to make data-driven decisions, identify bottlenecks, and optimize workflows.
To achieve this, these platforms combine data from multiple types of software development systems, including source code management, project management, release management, incident management, and more. SaaS software engineering intelligence platforms like Faros AI must also support the ingestion of data from multiple flavors of those sources, whether they be cloud-based or self-hosted.
The process for getting data from a source to a BI platform often depends on the source, but it can largely be summarized into two options: a data connector that pulls the data from the source into the platform, or a webhook built into the source that pushes data to the platform.
To choose which approach works best for your source, let's first compare these two options.
Software development systems typically expose APIs that enable interested parties to request and retrieve data. These APIs are often protected by some form of credential system, such as a token. A connector is a piece of software that uses this credential to authenticate to the API to retrieve (“pull”) the data from the source system (“data source”) into the BI platform. This connector is run periodically to ensure the platform always has the most up-to-date data within a reasonable timeframe.
This pull approach is the most common approach to ingesting data. Here are a few reasons why:
Some software development systems come with webhooks, which are internal components that can send data events to another party in real-time, or at least very close to real-time.
In this situation, the roles are reversed: The other party, such as a BI platform, exposes an API endpoint to receive data events. When an action takes place in the software development system, e.g. a new work task is created, the system "pushes" the event to the platform by making a request to the platform's API endpoint. This endpoint may also require a credential, which is supplied to the software development system when setting up the webhook.
Webhooks are an extremely useful tool and are commonly found in systems that are inherently event-driven, such as notification systems, automation tools, and e-commerce systems.
As a SaaS platform, Faros AI defaults to the pull approach for ingesting data. This means we develop, maintain, and run all the data connectors needed to generate the insights for our clients. But for us to run the connectors, our clients must supply us with the necessary credentials so that our infrastructure can authenticate to their software development systems. For some companies, providing system credentials to a third party is a non-starter. Perhaps they have compliance regulations that don't allow this behavior, or maybe the credentials cannot be scoped down enough to only allow the minimum set of permissions, or maybe they just don't want to do it.
For these situations, Faros offers a middle-ground option, which we call the "hybrid" approach. Our data connectors are open-source and available for anyone to download and run themselves. We can provide our clients with tailored instructions for running the connectors on their own infrastructure. This means they have full control over the operation and scheduling of the data connectors. However, full control also means full responsibility. The clients now have the added overhead of integrating the connectors into their automation stack along with the other engineering burdens of managing repeated jobs, and the time spent doing that can negatively impact other business operations.
Yet, for some clients, neither of these approaches may be ideal. But if their data sources include webhooks, they can now configure those webhooks to push their data events to Faros. This approach provides several advantages to the client:
The main drawback of webhooks is that, as an event-driven system, they do not support pushing historical data to another party, and platforms like Faros AI preferably ingest months of historical data to quickly generate actionable insights for our clients. To resolve this, Faros enables its clients to manually run the data connectors on their infrastructure — the "hybrid" approach from above — just once to pull all the historical data into the platform, and then use webhooks to push new events into the platform as they are generated. Since clients are only running the data connectors once, they don't have to deal with all the added responsibilities of automation and management that would be required to run the data connectors continuously.
Several popular software development tools support webhooks, such as GitHub, GitLab, and Bitbucket for source code management, and Jira, Airtable, and Asana for task management. Popular incident management systems like Pagerduty and OpsGenie, which are already event-driven, support webhooks as well.
Since the Faros AI engineering team uses GitHub for both source code management and a portion of our CI/CD pipeline, we've set up our own GitHub organization to send events to our platform.
As our engineers push commits to their development branches, the GitHub webhook pushes corresponding commit events to the Faros platform. It also pushes events when:
Combined with the ingestion of our task management data, the platform now has a complete view of a feature being added to our task list, to the feature being deployed onto our platform.
In general, it is very easy to get started with webhooks on a system that supports them, like GitHub. This is because the system itself does all the heavy lifting. There is no need for the user to manage any GitHub tokens, schedule any job automations, or worry about performance-related details like rate-limiting or throttling. You can see the single web page that encompasses the entire setup process for GitHub webhooks.
If you're thinking about enhancing your own BI platform to support incoming webhook events, here are a few tips to ensure the best experience for your customers.
We mentioned earlier that the main drawback of webhooks is that they can't push historical data. This means that your platform must minimize the chance of missing any incoming events, because if you miss events, then someone needs to run a data connector to pull the missed data. Therefore, your event-handling service must be highly available and reliable. Some ways to achieve this include (but are not limited to) load balancing across multiple instances, deploying instances across multiple data centers or cloud regions, and configuring auto-scaling policies to add more instances during peak traffic times.
You may have noticed in the GitHub screenshot that we configured our own webhook to send all events to our platform — the "Send me everything" option. It's much faster to choose that option than pick and choose which event types to push, and if your customer is just looking to get something working quickly, this is probably the option they'll choose as well. Or, your customer's software tool may not allow them to choose which event types to send. This means your platform should handle events that don't have any relevance to your product. But to avoid these extra events impacting the performance of your platform, your event-handling service should identify and discard these extra events as early as possible, ideally before the event gets into any sort of processing queue.
Even if your event-handling service has 100% uptime, there's still a possibility that some other component of your platform may have an outage that prevents an event from being fully processed. In these situations, your event-handling service should identify these errors as recoverable, and keep attempting to process the event until it succeeds. If you cannot retry indefinitely, have a backup storage system in place to store events so that when your platform issues are resolved, you can replay those errored events and get them into your platform.
In summary, while APIs and data connectors are the standard way of ingesting data into BI platforms, webhooks can provide immense value in the right circumstances. For companies that can't share credentials or want real-time data flows, webhooks are an elegant solution that puts control firmly in their hands. With high availability, validation, and error handling, BI platforms can fully leverage webhooks to deliver responsive insights.
If you're currently evaluating strategies to centralize data into a BI platform for software engineering, read more about Faros AI here.
Global enterprises trust Faros AI to accelerate their engineering operations. Give us 30 minutes of your time and see it for yourself.