In the fast-paced world of early-stage company building, making the right choices in application and deployment architecture is crucial. We’ve seen first hand how choosing the wrong infrastructure patterns from the get-go can cost startups a lot of money, but more importantly, wasted time. This is a story we often hear: a team begins with platforms such as Google Cloud Run due to the purported ease of use, only to later discover that these platforms lack the customization needed for specific tasks down the road. That’s not to say that Cloud Run isn’t the right choice! Just that it’s important that teams go in eyes wide open when making these decisions.

Below, we’ll explore a real-life example from a start up that transitioned from Cloud Run after running into some of these challenges. Tl;DR - the two issues that they did not see at the start, which caused a bunch of headaches on Cloud Run:

Heap Dump and Thread Dump Challenges. Cloud run doesn’t support performing heap dumps or thread dumps directly, which impacts runtime observability. This limitation impacts the ability for engineers to diagnose intricate memory consumption patterns or unexpected threading behaviors. This issue is prevalent across many languages and frameworks.
Stateless Limitations. Running persistent connections required, for example, by Celery for long running tasks created issues managing dedicated worker pools.

‍

Why Startups Choose Cloud Run

What is Cloud Run? Cloud Run offers a streamlined way to deploy containers on Google Cloud, combining the benefits of serverless computing with containerization. It enables you to quickly and easily deploy scalable, stateless HTTP containers. With Cloud Run, the provisioning and configuration of servers are fully automated by Google Cloud, including auto-scaling based on traffic demands. This means it can scale down to zero, ensuring you only incur costs when the service is actively used.

While there are many reasons that a startup would choose Cloud Run to get off the ground, we’ve found these two points as the primary drivers of this decision:

Cost-Effectiveness: As an early startup with limited traffic or compute, you only pay for services when they are in use.

Fully Managed Knative Service: Cloud Run abstracts infrastructure management, similar to Kapstan's approach of simplifying DevOps by automating deployment and scaling processes.

Although Cloud Run has significant advantages, it also comes with limitations that may influence its suitability for certain use cases. The biggest issue we’ve seen is the lack of customization, that may lead to months of reconfiguration costs in the near future.

GrowthFactor: Real-Life Example

GrowthFactor, one of our earliest customers, initially chose Cloud Run due to its serverless and cost-effective nature. However, Growthfactor was focused on delivering complex, data rich applications. While they didn’t realize this at the start, over time they came to require robust task management capability. Specifically, they required Celery support for handling long-running, persistent tasks that are integral to their operations. Celery, a distributed task queue, is designed to handle asynchronous tasks and manage work across multiple worker nodes efficiently. However, Google Cloud Run's stateless, request-driven model posed significant challenges:

Stateless Limitations

Cloud Run is optimized for stateless applications, which means it automatically scales down to zero when not in use. This feature, while cost-effective, is not conducive to maintaining persistent connections required by Celery for long-running tasks. The lack of state persistence makes it difficult to manage dedicated worker pools that need to maintain a constant connection to a message broker like Redis.

Complexity in Configuration

Managing Celery in a Cloud Run environment requires complex workarounds to simulate statefulness, such as using external services for state management. This added complexity can lead to increased development time and potential reliability issues, as the infrastructure is not inherently designed to support such configurations. We’ve often noticed architectures involve a mix of Cloud Run, Cloud Functions, and Compute Engine instances. Such architecture can slow down team velocity and streamlining such architecture requires significant time and effort to get back to delivering features to customers.

Logs Management Challenges

Effective logs management is crucial for any startup aiming to maintain high service reliability and quickly troubleshoot issues. However, Growthfactor encountered one big issue with Google Cloud Logging when using Cloud Run:

Log Delays: In high-traffic scenarios, logs did not appear in real-time, which is essential for diagnosing and resolving issues promptly. This delay hindered the ability to respond to incidents swiftly, potentially affecting service availability and customer satisfaction. Again - it was not a problem at the start, but months down the road, this issue caused unnecessary frustration and headache.
Lack of Heap Dump and Thread Dump: Heap dumps and thread dumps are essential diagnostic tools that developers rely on to pinpoint memory leaks, trace the cause of gradually increasing memory consumption, and uncover subtle performance degradation or resource contention. Without the ability to easily perform these diagnostics, teams can find themselves blindfolded, spending significantly more time trying to diagnose elusive issues through indirect methods such as logging or monitoring system metrics. Such limited debuggability can increase downtime, lengthen the incident response cycle, and decrease overall developer productivity, especially when troubleshooting complex or chronic runtime issues. By the way - this limitation exists on a number of programming languages and frameworks (Python, Node.js, Java, and Go).

‍

These challenges highlight the importance of choosing a platform that aligns with the specific needs of your application architecture and operational requirements. For Growthfactor, the limitations of Cloud Run in handling stateful, long-running tasks and effective logs management prompted a reevaluation of their deployment strategy.

The moral of the story? We’ve observed a number of start-ups who experienced the pain of fragmented architectures as they transition from a configuration-constrained platform. These architectures often involve a mix of Cloud Run, Cloud Functions, and Compute Engine instances. Such architecture can slow down team velocity when it becomes an issue a few months down the road; and streamlining such architecture requires significant time and effort to get back to the main task: delivering features to customers.

Why Kapstan Chose to First Support GKE

Spoiler alert: Provisioning resources and deploying workloads onto GKE via Kapstan is easier and faster than using Cloud Run.

We chose to support GKE from the get-go so that we could offer startups an alternative path. One that offers more control, customization, flexibility, and integration with any (or all) 3rd party tools that may be required in the future. And most importantly - with none of the usual overhead that comes with day-2-Kubernetes management.

But..what about GKE cost?

Deploying two GKE clusters and their associated resources with Kapstan costs approximately $200 more per month than using Cloud Run. For venture-backed startups with access to cloud credits, this additional $200 per month is often a minor expense, especially when considering the overall ROI of utilizing GKE.

So, what does the ROI look like?

Full Control and Customization: GKE allows for tailored configurations, giving startups the flexibility to optimize their infrastructure for specific needs.
Flexibility: With GKE, you can handle diverse workloads and design systems that adapt as your business grows. More importantly, leveraging GKE from the get-go makes switching clouds (to utilize credits) or moving to a multi-cloud architecture very easy.
Scalability: GKE’s robust scalability ensures your infrastructure can keep up with increasing demands.
Zero tech debt: say goodbye to fragmented cloud tools and the months of effort required to migrate down the road

However, the true cost of GKE lies not in its price tag but in the time it demands. Managing infrastructure sprawl and handling day-2 operations can quickly pull focus away from what matters most: building products. For early-stage startups, time is often the most valuable resource.

Enter Kapstan. We built Kapstan to offer startups a seamless way to harness the power of GKE, with none of the overhead.

Provisioning resources and deploying workloads via Kapstan is easier than via Cloud Run. Every engineer on the team can spin up services, deploy, and monitor applications with ease in a single pane of glass. It’s so easy that your intern can do it. And with Kapstan, you get to harness the power of Kubernetes - with none of the headache.

‍

Observability out of the Box with Kapstan

‍

Conclusion

If you are an early-stage, VC-backed startup, and if the following applies to you:

Want to build leveraging a microservices and event-driven architecture
Want multi-cloud support (make use of those cloud credits $$$)
Want room for future, unknown, advanced configuration/customization (e.g., an app that will require persistent storage) based on future product pivots
Want zero DevOps operational overhead, since your customers care about your product, not your helm charts

Then Kapstan is worth a look. Explore Kapstan today and see how it can transform your deployment strategy.

‍

Shyam Kumar

Co-Founder and Head of Product @ Kapstan. Shyam is a former back-end developer and product manager with a decade of experience in leading teams and product building. Outside of that - he loves reading memoirs, playing a variety of sports, and meeting new people.

Cloud Run vs. GKE - Choosing the Right Path

Why Startups Choose Cloud Run

GrowthFactor: Real-Life Example

Stateless Limitations

Complexity in Configuration

Logs Management Challenges

Why Kapstan Chose to First Support GKE

Conclusion

‍

Related posts

21 Best CI/CD Tools to Streamline Your Development Lifecycle

Reducing React app load times using Amazon CloudFront CDN

Automating GCP Infrastructure with Terraform: Using Google Kubernetes Engine

Simplify your DevEx with a single platform