Skip to main content

What does a gold standard of performance looks like - demo of Arré voice app

· 4 min read
Aditya Kumar
Aditya Kumar

In this post, I want to show the gold standard of how a mobile app should behave, especially on a fluctuating internet connection.

This is the Demo of Arré voice app (https://www.arrevoice.com/). It's a social network over short audio. If I have to oversimplify, it's Twitter for short audio. Check out the video below -

Click here to watch the video on Youtube

You can see how the app performs with a fluctuating 5G internet connection. Infact, coincidentally, the internet fluctuated from 4G to 5G in between, and you can see how the app gracefully handled that. When I left Arré (in March 2024), the app had a 100% crash-free rate for the last two quarters on both Android and iOS. That is despite the app's primary target audience being Tier 2 and Tier 3 India, where mobile devices are usually not very high-end in terms of processor or memory. These are smartphones under Rs. 10k or 20K.

Even though the app uses multiple databases such as - a NoSQL database (DynamoDB serverless), Cache database (Redis), a Graph database (AWS Neptune), and a Vector database (Astra DB), the performance you see in the video was achieved at less than $10 spent per day on databases at a scale of 3k+ daily active users (250k+ total users. 70k+ MAU). The platform has a macroservices architecture that consistently serves 2000+ RPS across 12 services with a p99 latency of less than 100ms.

Performance is objectively measurable in tech!

Unlike your brand marketing strategy or an untested GTM strategy, which may or may not be measurable, Measuring the performance and stability of a tech product is very objective. For a mobile app, I define the gold standard of performance as -

  • You have to track the app launch time, and it should be as fast as possible. Arré voice app has less than a second of app launch time.

  • API response time, specifically P99 and P95 latency, should be as low as possible - Arré voice app has P99 of less than 100 milliseconds and P95 of less than 80 milliseconds for every feature on the app - including group messaging and analytics

  • Database query times should be as low as possible. Read latency should be less than ten milliseconds. (For a single query)

  • The crash-free rate should be as high as possible. Arré voice app had a 100% crash-free rate on Android and IOS.

  • The number of bugs users encounter per month should be less than five.

  • The app should have the same performance at any scale for any feature.

  • The cost of all of the above should be as low as possible. If your infrastructure is over-provisioned, you may get a short time gain in performance, but you are setting yourself up for failure when the actual scale hits. Also, you are just hurting your bottom line.

There are several open-source and closed-source tools for measuring all the above, and if you have a well-funded company and team, I can't think of a single reason not to do it.

I can go on and on, but you get the point

The scale here doesn't matter. In my previous organization Leher App , we served a peak DAU of 250K+ with similar stats in terms of stability and performance. The point I am trying to make is —this is possible.

If your app is performing any less than this, you are just making a compromise. I achieved all this with a small team of 7 developers and zero QAs. And I if you want a similar end-user experience and objectively good stats in speed and performance, use this link to schedule a call with me (click here).

Prefer async communication? - drop a Hi on Twitter or LinkedIn.

Avoid Bugs, Performance Issues, And Crashes in your mobile and web applications

· 12 min read
Aditya Kumar
Aditya Kumar

Every time a bug, crash or performance issue appears in any tech product, its attributed to a list of usual suspects like -

  1. Poor code quality
  2. Wrong selection of tools and technologies.
  3. Wrong selection of languages/frameworks/databases etc.
  4. Selecting or Not selecting a "Microservices" architecture.
  5. Wrong selection of cloud architecture or deployment

But what if I told you that there is a simple pre-requisite of building a good technology product that is easiest to execute and its often overlooked.

Its writing an Engineering requirement document !

Taking a documentation first approach is the most efficient way to avoid bugs, performance issues, crashes, technical debt and ensure a high-quality output.

This blog post will give you a template for defining your engineering requirements that has worked exceptionally well for me. It will help you structure your solution better. I have used it to describe and build several products that solve complex problem statements, so it's battle-tested.

I never had access to big engineering teams. I learned the hard way that the best way to avoid problems, as mentioned earlier, in the long run, is to structure the product problem statements, engineering requirements and the What, Why and How of the solution well before writing a single line of code. This template will help you if you are an engineering manager or developer.

In one of my previous posts, I gave a PRD template for structuring your product requirements. Please check it out if you haven't done already. Think of it as a precursor to planning your engineering product. (click here) If your product problem statement isn't defined well, this ERD template won't be as effective in solving your problems as you would ideally want it to be.

How to use this template?

Before I share the template, let me share a few pointers on how to use it. The idea is to avoid ambiguity while using this template.

  1. Define a single Engineering requirement document (ERD) for each product. If you are extending or changing the product, make changes in the same document. Don't create a separate doc for new features or modifications to the existing system. It will cause issues in the long run.

  2. Most tools where you write the ERDs have a versioning system that can fetch the document's historical version if needed. Please ensure your tool has this functionality, too. If not, do a manual backup, or if you are a developer, you can always use a Git repository to do versioning of the engineering documents.

  3. If your product depends on an existing product or system in your organization, mention the links, constraints and considerations of the older product in the new ERD.

  4. Write down the considerations you took into account for a part of your solution, especially if your solution is experimental- a template of such considerations can be - “I took this decision to do <X> like <Y> because of <z> reason. It will help you iterate on the solution better.

  5. Be as detailed as possible. The devil is always in the details. The template below describes what you are supposed to write at a high level. But it doesn't dictate what level of detail you can go up to. That's your choice. And trust me when I say this- the depth of your thought would count. The depth of your solution will contribute to the long-term success of your product.

If you follow the points mentioned above religiously, you will end up with an ERD document that efficiently solves 99% of the communication, scoping and product quality problems and will preemptively eliminate all the technical debt you could have incurred in the long term.

Engineering requirements template

*************** template starts here ********************

1. Product requirements

Write a high-level summary of the product requirements derived from PRD (product requirements document). If it's already present in the PRD, copy and paste it. Mention the link to the PRD.

Mention the link of the UI/UX file (Figma link, etc.), if any.

2. Functional requirements of the system

2.1 - Clarify ambiguity

  1. Read the product docs and understand each feature well.

  2. Think about all the edge cases from the product perspective and check if they have been properly explored in the product doc or not.

  3. If there are any dependencies on third-party tools and services, please ensure they have been explored properly in PRD. Classify these dependencies into one of the sections below (functional requirements/ considerations/ constraints/ engineering problems to solve) and detail them in that section.

  4. If there are any dependencies on existing systems (any existing internal services), please ensure they have been explored properly in PRD. Classify these dependencies into one of the sections below (functional requirements/ considerations/ constraints/ engineering problems to solve) and detail them in that section.

2.2 - Write each functional requirement clearly from the engineering perspective.

Engineering functional requirements essentially translate the product requirements into their tech counterpart. For example, if the product requirement is the “Login flow of LinkedIn”, in that case, the functional requirements will be the high-level description of all the APIs associated with that user journey. Please write down the possible engineering edge and corner cases and how to handle them as a requirement.

2.3- Define what is out of scope from the product and engineering perspective

Check the PRD for things that are out of scope from the product perspective. Please ask the Product/Engineering manager to update all the out-of-scope items if they need to be written. If they are already written, copy the link to that section of PRD here and then write down all the things that are out of scope from an engineering perspective.

These should be everything you factored out while designing the solution in this document.

2.4 - Write down the system considerations and constraints from an engineering perspective.

This section should create a list of -

  1. All the constraints of the engineering systems - Write down all the things this system won’t be able to do. Also, mention why it won’t be able to do that.

  2. Write down all the considerations you took into account from a non-functional and functional perspective for designing the solution later - The template of considerations is - I took this decision to do <X> like <Y> because of <Z> reason

2.5- Define any specific engineering problems to solve

This section should describe all the things not directly related to the product problems but are needed on the engineering side to solve the functional requirements mentioned. It can be as simple as ” We need a URL shorter system for this” or as complex as “Our existing pub-sub system won’t work in this case because of XYZ reason, so we need to come up with a different/better solution”.

3. Non-functional requirements of the system

  1. Write down all the non-functional requirements - examples are availability, capacity, reliability, scalability, security (high level), Maintainability + Manageability, etc.

  2. Discuss the load expected on the system with the product manager (check if it's already mentioned in the PRD or not) and mention that here

  3. Discuss any data-related non-functional requirements here - for example - data access speed, encryption, data consistency (strong/eventual)

4. Back of the envelop estimate

4.1- Estimate load and explore it in detail -

  1. Requests per second/operations per second/ actions per second - write down the estimated figures.

  2. Database storage and cache storage estimates - Estimate the amount of the data stored (per minute/hour) and establish a read-write ratio.

  3. Media storage - read/write pattern and size of data.

4.2- Synchronous vs Asynchronous communication

  1. Figure out which parts of the functional requirements need sync communication

  2. As a rule, for inter-service (inter-system) communications, Try to lean on async communication as much as possible.

5. Define the interfaces - High-level definition

  1. Choose the proper interface according to the requirements and back-of-the-envelope estimate-

    1. REST API

    2. gRPC

    3. GraphQL

    4. Websockets

    5. SSE

  2. If you are choosing anything apart from REST API - mention the reason for that choice.

  3. Explain each of the interfaces clearly -

    1. Define the input parameter and type of the input parameter (query, body, URL, etc.)

    2. Define the output very clearly (what are the fields in the response, the response structure, etc.).

    3. This section should only describe the interfaces clearly. The detailed logic of it should be explained in a detailed section later.

  4. Factor in the end-user behaviour in these interfaces. We design the interfaces before writing data models to ensure we have thought about end-user access and write patterns. Our data models should reflect the same thinking (detailed in the next step)

6. Data models and low-level design of the system

6.1.- Schemas

  1. Define the different SQL and NoSQL schemas required for the functional requirements.

  2. Define the relationship between the schemas.

  3. Mention the data type and the corresponding validation checks like maximum length, what data is not allowed (for example - special characters), what level of precision is required (for example, time is stored in UTC nanoseconds), etc

6.2 - Explore the end user read/write patterns

  1. Write down the read/write patterns of end users for each of these schemas - Estimate this based on the expected user behaviour in product requirements.

  2. Define the partition keys, secondary indexes, etc., for each schema based on these patterns and functional requirements.

  3. Define if there are any secondary models(models that contain the same data stored in a manner which is optimised for end-user experience) required for carrying out the functionality.

  4. Check if a single document/record of this schema has the potential to become very large. If yes, evaluate whether breaking the schema will be more practical.

7. Architecture diagram with details of the tools and technologies used

7.1 - Case 1 - If it's an addition or extension to an existing functionality/system

  1. Mention the links to all documentation related to that existing functionality/system.

  2. Draw a diagram depicting how it will talk to the various components in the current system. The diagram should clearly show the flow of data across the entire system.

  3. Note any major changes/decisions made in this new system that can directly or indirectly affect the existing system.

  4. If there was a particular engineering problem to solve, note what solution you used to solve that problem.

7.2 - Case 2 - If it's a new functionality/system

  1. Draw a diagram depicting all the components and how they communicate. The diagram should clearly show the flow of data across the entire system.

  2. If there was a specific engineering problem to solve, note what solution you used to solve that problem.

  3. Mark the components shared across systems - databases, file systems, media storage, etc.

  4. Mention the tools and technologies used clearly in the diagram -

    1. Language/framework

    2. Databases

    3. Queue

    4. File systems

    5. Media storage

    6. Third-party tools

    7. Utility libraries

  5. If any of the above tools and technologies differ from the existing systems, mention the reason for choosing them.

8. Business logic of the interfaces

Detail the business logic of each interface, starting from validation checks to when the output will be generated. This should be written in a line-by-line pseudo-code format. The idea is to have a good clarity of logic before we start translating that into code. Send this logic to your peers/managers for review to cross-check if you missed anything. You may create sub-documents of this document and mention the link if there are more than five interfaces. This will ensure that this document is cleaner and more readable.

9. Client-side logic and interactions

9.1 - Compare the interfaces with the design file

Do a final check to ensure that the interfaces designed earlier cater to all the user journeys created in the design file. This is a redundant but essential check to ensure that you have not missed anything.

9.2 - Check whether all the possible error cases have a corresponding UI/UX in the file.

Your API documentation should detail all the possible error responses, and the design file should have a UI/UX element for all these cases. Check the design file for the same.

9.3 - Check whether the analytics events (and other tracking tools) are available in the product requirements.

Analytics and other tracking requirements are often missed while writing PRDs or ERDs. These events can be triggered on both the client side and the server side. Ensure that they are clearly described with

9.4 - Detail the most critical client-side validations and security checks

In many cases, client-side validations and security checks play a crucial role in avoiding catastrophic security failures. If your use case has any of these cases, they should be mentioned in this required. Example - My application requires a client certificate security for a socket connection (handshake).

10. Application and infra-level logging and alerts

In this section, answer the following questions -

  1. What things need to be logged at the application and infra level?

  2. What things need to be alerted at the application and infra level?

Your final application code should factor in the application layer logging and alerts.

11. Single points of failure

Write a note about the single point of failure in the system. Discuss with the leads how critical these points of failure are and what steps we should take to ensure the disaster response and recovery process is fast.

11. Security consideration

  1. Write down the possible security failures in the solution described above

  2. Propose a solution for addressing them.

  3. Figure out places that will need specialised security.

12. API documentation

Mention the links to all the relevant API and other interface documentation.

*************** template ends here ********************

Ending note

I hope this template helps you as much as it has helped me. If you have any questions, doubts or suggestions, please feel free to contact me on Twitter, Linkedin or Instagram. Do share this post with your friends and colleagues.

Structure your product problem statements for faster time to market and good performance

· 7 min read
Aditya Kumar
Aditya Kumar

Things get out of hand when working in a fast-moving startup or organization of any scale that believes in fast product iteration cycles or experimental iterations on products. If you are part of such a setup, you must have seen problems such as

  1. Miscommunication, misunderstandings, and misalignment among team members that often result in wasted time and effort.

  2. Product requirements keep changing or evolving beyond the initial scope, leading to delays, frustration, compromised product quality, etc.

  3. Inconsistent or poor quality end-user experience as all the possible edge and corner cases from a product perspective are not explored or solved correctly.

Many other problems arise when there is no method to the madness. At a high level, it all results in an output that neither solves the user's or business problems well.

After building multiple large-scale products and profitable companies over the last decade, I have realized that a structured approach and sufficient planning go a long way in solving these problems. If one product owner is hell-bent on solving it by thinking deeply, they only need a way to structure the solution, and most problems disappear.

This blog post will give you a template for defining your product requirements that has worked well for me. It will help you structure your solution better. I have used it to describe and build several products that solve complex problem statements, so it's battle-tested.

Since I worked mostly with startups, I seldom had access to a "product management team", and I learned the hard way that the best way to avoid problems, as mentioned earlier, in the long run, is to structure the product problem statement and the WhatWhy and How of the solution well before writing a single line of code.

This template will help you if you are a product manager, engineering manager or developer.

Before I share the template, let me share a few pointers on how to use it. The idea is to avoid ambiguity while using this template.

  1. Define a single product requirement document (PRD) for each product. If you are extending or changing the product, make changes in the same document. Don't create a separate doc for new features or modifications to the existing system. It will cause issues in the long run.

  2. Most tools where you write the PRDs have a versioning system that enables you to fetch the document's historical version if needed. Please ensure your tool has this functionality, too. If not, do a manual backup, or if you are a developer, you can always use a Git repository to do versioning of the product documents.

  3. If your product depends on an existing product or system in your organization, mention the links, constraints and considerations of the older product in the new PRD.

  4. Write down the considerations you took into account for a part of your solution, especially if your solution is experimental- a template of such considerations can be - “I took this decision to do <X> like <Y> because of <z> reason. It will help you iterate on the solution better.

  5. Be as detailed as possible. The devil is always in the details. The template below describes what you are supposed to write at a high level. But it doesn't dictate what level of detail you can go up to. That's your choice. And trust me when I say this: the depth of your thought would count. The depth of your solution will contribute to the long-term success of your product.

If you follow the points mentioned above religiously, you will end up with a PRD document that efficiently solves 99% of the communication, scoping and product quality problems.

Product requirements template

[Product Title]

Description: What is it?

A very high-level description of the product that can be sent to a non-product person or techie to understand quickly. Think of it as a pitch for the product/feature.

Problem: What problem is this solving?

Specify why we are building this product/feature. Define who is facing this problem. "Who" can be a user or another actor in the system. Define their persona.

Why: How do we know this is a real problem and it's worth solving?

Add the data or assumption that pointed you in this direction. If this was a requirement from another team/function, write that note here.

Success: How do we know if we've solved this problem?

How will you measure the success objectively? It can be done using metrics tracked in our analytics tool or data warehouse. Define those metrics in detail.

Audience: Who are we building for?

Define the persona (in detail if required). It will also help us better target the audience within the product and retention activities (like notifications, etc.)

What: what does this look like in the product?

  1. Key feature - describe the features in as much detail as possible.

  2. User journeys - should cover all possible states of the user journey

  3. Figma link

  4. Product copy for all the places

  5. Out of scope

  6. Future considerations - Optionally list features you are saving for later. These might inform how you build now.

  7. Constraints and dependencies - explore all the limitations and dependencies possible. Examples - can be tech constraints, constraints because of existing features on the product, etc.

  8. From a product perspective, list the edge/corner case scenarios and their solutions - for example, the user didn't do X but did Y. It resulted in a Z state in the product journey.

How: What is the experiment plan?

If this feature/product is an experiment, define how the investigation will occur. How will it be controlled (for example, start/stop using a feature flag)?

When: When does it ship, and what are the milestones?

Divide the project into phases, each with milestones (or all stages can work towards a single metric/milestone). Add tentative dates next to them.

Open questions (optional)

Write any questions on top of your mind while writing this doc or any of its versions. Write the questions in the raw form with your thought process or assumptions. The idea is to revisit it later.

Non-visual product requirements (optional)

Explore all the non-visual production requirements needed to make this a success—examples - Push notifications, emails, WhatsApp messages, etc. Explore the product copy aspect of this as well.

Third-party tools

Mention if there are any dependencies on third-party tools and services (like Firebase, amplitude, etc.). Ensure they have been explored in depth in PRD.

Is seeding required in this product? (optional)

Explore whether any manual intervention from growth/content or any other teams is required in this product/feature.

Go to market (optional)

If this is a new product/feature being shipped and a specific GTM is required, then detail it in this section. Explore the risks and risk mitigation strategies if needed.

Create a launch checklist to ensure that everything is ready before going live.

Glossary

Define new or specific terminology in this document to remove ambiguity while communicating.

Ending note

I hope this template helps you as much as it has helped me. If you have any questions, doubts or suggestions, please feel free to contact me on TwitterLinkedin or Instagram. Do share this post with your friends and colleagues.

A brief history of Me

· 7 min read
Aditya Kumar
Aditya Kumar

This post will give you a brief introduction to my journey so far. If you are reading this, I have probably forwarded it to you to save some time before we have a face-to-face conversation or a video call. I have intentionally kept it in bullet points to keep it crisp and no-nonsense. I would rather tell the anecdotes, experiences, and stories to you in person than have you read a bad attempt at an autobiography (Which I maintain is borderline narcissism unless you are a celebrity).

But let me give you the TLDR version first. In the last 14 years, I have-

  1. Worked on some of the most complex technology systems you can build - Media streaming, real-time communication, code compilation engines, high-frequency trading, virtual currencies and social networks.

  2. Co-founded a company and scaled it up to $2 Million in ARR without any venture capital.

  3. Served over 20 Million unique users through various systems I architected from scratch.

My work has always been at the intersection of Product, Technology, and Growth so that I will categorize each journey similarly.

Jan 2023 to March 2024

Worked with Arré as the Head of Engineering, leading their product, design, and engineering teams. Arré is one of India's leading digital content and media-tech brands that is home to Arré Studio, which produces and publishes original content with professional creators across genres, languages, formats, & platforms, reaching more than 300Mn people and Arré Voice, a women-first short audio app aimed at building a new generation of creators. 

Technology

Built a short-audio women-first social network that served 250k+ users with a team of seven developers, three designers, and one product manager. The main product is a mobile app that is like an audio Twitter. People can educate or entertain their audience using 30-second audio recordings called voicepods that reached 250k+ users. 70k+ MAU.

  • Created an algorithmic feed (collaborative filtering-based recommendation engine) using Vector embeddings and Vector databases(AstraDB) that directly led to a 30% increase in stream time. Created a social graph to supplement the algorithm (Neptune DB).
  • Created a macroservices architecture that consistently served 2000+ RPS across 12 services with a p99 latency less than 100ms.
  • Created a flutter app with 250K+ downloads with a 100% crash-free rate with a team of only two mobile developers
  • Tech stack - Typescript (NodeJs, ReactJS & NextJs), Golang, DynamoDB, Redis, Apache Cassandra, AWS Neptune(Graph database), Open Search, Google Bigquery, Flutter, AWS Fargate, AWS API gateway, Google Pub/Sub, and Vector database (AstraDB).

Product and Growth

Devised various product-led growth strategies such as -

  • Arré Match - A platform that combines the idea of influencer marketing and performance marketing. Micro-influencers could come to the platform and earn money for promoting small brands and, in return, earn money for each click/impression. So, instead of a traditional influencer marketing model where the influencer gets paid upfront, in this model, brands are supposed to pay the influencers only when they deliver some measurable value. We piloted this for six months and realized that irrespective of product intervention; it would fundamentally be an operations-heavy business. We neither had the bandwidth nor the correct resources to execute this, so we shut it down.
  • Voice Clubs - Since the core platform was interest and vernacular language first, I strategized and executed a community feature to increase cohort retention. The feature is still active on the app. It's the company's current GTM strategy, and the jury is still out on this.

Oct 2018 - Sep 2022

My official position was Principal engineer, but I worked as a virtual co-founder, contributing ideas and execution for the tech, product, and growth.

Technology

  • Mobile application that was a clubhouse like Audio-Video-first social network (with all features of a social network) handled a peak DAU of 100K+ users with excellent speed and performance. Peak DAU of 250K+ handled across various products.
  • Created various web-based growth hacking tools that reached 10Mn+ users in just four months of launch
  • Created push notifications sending architecture capable of sending 10Mn+ notifications daily.
  • Created a flutter app with ~2Mn downloads with a 99.7% crash-free rate with a team of only two mobile developers
  • Architected payment systems for virtual currency and loyalty coins that processed one crore+ rupee's worth of transactions within eight months of launch.
  • Ran production environment on AWS, GCP, and Azure.
  • Tech stack - Javascript (NodeJs, ReactJS & NextJs), Golang, Mongodb, Redis, Apache Cassandra, Elastic Search, Google Bigquery, Flutter, Kubernetes, Serverless, Google Pub/Sub, and WebRTC.

Product and growth

Co-devised various product-led growth strategies such as -

  • Create a monetization tool called Leher Lifafa that drove ~ INR 3 Crores of GMV. It was a phenomenal growth hack for acquiring communities running on Telegram in large numbers. We acquired millions of users with zero spending on performance marketing.
  • Leher coins - Created a gamification strategy that led to 200% organic growth in Daily Active users, W1 and W4 retention.
  • Leher clubs—Leher became big around the launch of Clubhouse in the US. Leher clubs were quite similar to the clubs in the Clubhouse, with additional functionalities like video rooms, text chats, direct messaging, and video recording. This feature helped us increase our retention by 100%.

Aug 2015 - Oct 2018

Co-founded and scaled up an Edutech+HRtech startup called edwisor.com from 2 people team to 100+ people and made ~$2Mn Annual revenue without any Venture Capital.

Technology

  • Built complex technology products such as live video streaming, code compilation engines(like hackerrank), gamified learning modules, private StackOverflow, etc.
  • Architected systems that served 10k+ daily active users with a small team of 6 people and a significantly low cost.
  • Built customized learning management systems and hiring solutions for the customers on the platform.
  • Tech stack - Javascript (NodeJs, Angular), Mongodb, Redis, Kubernetes, Serverless, RabbitMQ, and WebRTC

Product and growth

As a co-founder, I was also directly responsible for product and growth. Devised and executed various strategies such as -

  • Inbound lead growth using Quora - We understood that one of the core behaviors of our target audience was posting questions and reading answers on Quora. We created a strategy and team to tap into this behavior that led to a 5X growth in the number of inbound leads and a 3X increase in revenues every quarter.
  • Growth of mentor pool using a tech tool - Mentors were the trainers on the EduTech Side of the platform. They were responsible for teaching, reviewing assignments, answering support queries on our custom-built community platform, etc. I was able to scale this operations-heavy business to a pool of 80+ mentors with a small team of 4 people and a little bit of gamification magic.
  • Led the content strategy for both web and data science career paths and strategy for partnering with companies that hired our graduates.

March 2010 - 2015

I sold my first application when I was 17. It was an ASP.NET(C#) application created for a local coaching centre. After that, I worked on twelve freelancing projects and four internships in various domains: Health Tech, Legal Tech, FinTech, Edu tech, HR tech, and Travel Tech. Not all the products became big businesses, but this phase of life turned me from a simple techie to an entrepreneur by heart, so it was worth mentioning in this post. After all, it's supposed to be a brief history of me.

Less than 10 minutes of reading time, Mission accomplished!

I hope this post serves its purpose. I have been asked, "What are you?" instead of "Who are you?" more times than I admit. I hope this post answers that question as well. I am looking forward to our meeting. If you haven't booked the call and came across this post directly, you can use this link to schedule a call with me.

Prefer async communication? - drop a Hi on Twitter or LinkedIn.

The Essential Checklist for Ensuring Basic Security in Your Systems

· 16 min read
Aditya Kumar
Aditya Kumar

Security Cover Image

Cyber Security is one of the things that is often overlooked by organizations unless they have been hacked once or twice. In most cases security breaches can be disastrous. If they are not leading to complete shutdown of the organizations, they can certainly tarnish the reputation of the organization in eyes of its customers. You can find the list of some of most popular hacks in the recent history on this link.

The aim of this blog post is to give you a basic checklist of security practices that applies to companies of all sizes. If you have crossed off every item on the checklist, then you can be assured that a novice/amateur hacker won't be able to hack you.

I have tried my best to compile a basic checklist. But if you feel I have missed something, please ping me in the comments or on my Twitter or LinkedIn and I will add it to this blog post.

A small note on observability

Before we go into the detailed security checklist, it's important to discuss observability. You will notice that observability has been mentioned multiple times in each security category we discuss in the blog. The reason is simple: Regardless of how well you design or implement your security setup, there is always a chance of getting hacked. That is a reality you must accept and be prepared for. Your observability setup can be a huge difference-maker in stopping the attacks when they are happening and preemptively preparing for them. Hackers often prepare for large attacks by poking holes into your system and figuring out the attack vector. If you have followed the checklist below and used your observability set up to support your security measures, you will have a better chance against all current and future attacks.

If you don't know what observability is, please go through this link. I encourage you to read about it as much as possible. People often confuse observability with monitoring, but they are not exactly synonyms. Let me explain with a few analogies -

  1. Monitoring is like having a smoke detector, alerting you to potential fires, while observability is like having a full fire alarm system, pinpointing the exact location and cause of the fire.

  2. Monitoring is like watching a single gauge on your car's dashboard, it tells you if something is wrong (e.g., overheating) but not necessarily why. While Observability is like having access to all the gauges and engine data, it provides a complete picture, allowing you to diagnose the root cause (e.g., faulty fan or coolant leak).

Therefore, while monitoring is a crucial aspect of observability, they are not interchangeable. Ideally, both work together to understand system health and performance comprehensively.

Various categories of security

For the sake of simplicity and better organization, we will divide the checklist into the following sections.

  1. Application layer security

  2. Infrastructure security

  3. Data privacy

  4. Physical security

  5. Backup and disaster recovery

In most organizations, these categories are owned by different teams/people. But if you are a startup developer, then its all on you. Either way, I recommend you go through the entire list.

Application layer security

Years ago, there was a famous joke I used to hear - "If hackers can reach your application and database layer, then they deserve to hack your systems". The joke's premise was that infrastructure security should be strong enough to withhold any significant attack. But times and technology have changed, and the attack mechanisms have become more sophisticated. From supply chain attacks to leaked credentials from the application layer, bad actors always find a way to reach your application layer. But, you can do the following things to ensure you haven't made their life easy.

  1. Don't program endpoints that expose infinite/large amounts of data—whether you have REST APIs, GraphQL, Websockets, gRPC, or something else. Always ensure that you are not programming endpoints that, under any circumstances, allow infinite/large amounts of data to be queryable easily. People often do this for internal tooling within the organization, which becomes the attack vector through social engineering. Also, ensure you control page sizes(and other limits) from the backend and not as an input parameter. This will ensure the bad actor cannot query too much data in a single query.

  2. Strong authentication and authorization - Use proper authorization checks on all interfaces(REST APIs/ GraphQL of whatever else you use). Ensure that these checks are the first set of checks in the programming logic. Never bypass your own system's authentication and authorization. People tend to do that for internal APIs/tools or other exceptional cases (that your manager asked for). However, a single interface like that can be the downfall of your entire system.

  3. Implement the principle of least privilege, granting users and services only the minimum permissions necessary to perform their tasks. Regularly review and update access controls to align with this principle. Do this for credentials of third-party services and cloud providers as well (IAM roles, secure tokens, authorization files etc).

  4. Don't hardcode credentials, and follow twelve factors as much as possible. Configuration-first applications are the future and secure by design. If you don't know what this means, please read this blog post to understand it better in the modern context.

  5. Perform strong validation checks of your input and output and do data sanitization as much as possible to ensure that people cannot execute old-school attacks like SQL injection.

  6. Secure communication and data practices—Enforce TLS on all your endpoints. Ensure that you encrypt mission-critical data, such as users' personal information, payment information, account information, etc. You can also use mutual authentication if you are building for mobile clients or building realtime communication using WebSockets or WebRTC. When it comes to mutual authentication, I prefer client-side certificates over the other approaches as they are relatively simple to maintain.

  7. Implement CSRF protection mechanisms, such as anti-CSRF tokens, to prevent malicious third-party sites from performing unauthorized actions on behalf of authenticated users.

  8. Implement relevant security headers in HTTP responses, such as Content Security Policy (CSP), Strict Transport Security (HSTS), X-Frame Options, etc. I know CORS has caused most developers pain at some point in their lives, but it's worth it when it comes to security.

  9. Regularly update your open-source packages, especially when security updates are released. This will ensure you can prevent supply chain attacks.

  10. Always use timeouts for every long-running task/functionality. This will prevent attacks similar to Slow DOS attacks. If hackers find out that some functionality in your system results in a very long execution and uses a lot of memory and computing, they can use that to overload your system and consequently take it down.

  11. Don't send sensitive information in the logs/error logs - this will be covered again later in the Data privacy section, but just consider this a thumb rule. Any information sent in logs can be intercepted or acquired by users through different means and can be used to determine the attack vector in your system.

  12. Use your observability setup to log and alert about weird behaviour in inputs/outputs of the system. Anything out of the ordinary can be sent to the error logs and then developers can evaluate that not as a bug but as potential security attack.

  13. Do penetration testing and other vulnerability testing, especially if you are building mission critical systems. Hire external agencies if your team doesn't have the necessary bandwidth or skill set to do this.

  14. Think like a hacker - Last but not least, it's very important for developers to think like hackers when they are writing their programs. People in infrastructure roles tend to do this by default because of their training, but in my experience, developers often don't bother with it too much while writing code. Developers need to stay up-to-date on the latest hacks and hacking news, in general, to ensure they know what to take care of while writing code.

Infrastructure layer security

Infrastructure is usually the first to take the hit in case of coordinated attacks. This is also a low-hanging fruit for bad actors if they are targeting a mid-scale company or a startup because they tend to spend less time and resources on infrastructure security. You can follow the following basic pointers to ensure that your infrastructure is secure -

  1. Strong network security - Check your VPC setup for potential leaks like unnecessary open ports or any outgoing traffic that is allowed without any authorization. Infact, revisiting incoming and outgoing traffic settings at a regular interval helps a lot. When you are talking to third-party services, at least ensure that you are using IP-based security. People often ignore IP-based security if some kind of authorization file/key is also available. But please understand that even if the attacker doesn't get access to your data without that file, they can still abuse your endpoints by trying to establish several open connections and keeping them busy. If given a choice between IP-based security and VPC peering, I always try to go for VPC peering. Even though the initial setup is tedious, it's worth it in terms of speed, scalability and security benefits in the future.

  2. Protect your keys, secrets and authorization files like your life depends on it - Implement the principle of least privilege as much as possible. Keep revisiting the permission of each key/secret/authorization file and user at a regular interval. Your infrastructure and application setup should only use keys using environment variables or from mount points in order to avoid accidentally exposing them to people who don't have the necessary authorization.

  3. Keep rotating each user's and service's credentials, keys, secrets and authorization keys at regular interval. For most of these recurring activities, I just mark a recurring event in my calendar. It serves both as a reminder and blocks a time period for carrying out this activity.

  4. If your organization/systems are susceptible to DDOS attacks, you can either setup IP based rate limiting along with alerts for unusual behaviour or go for a paid DDOS protection service depending on the severity of the problem, your team's bandwidth and skillset.

  5. Endpoint protection - Ensure that your API endpoints are not easily accessible or reversible. If you have certain endpoints that expose some privileged data, consider creating them as private endpoints available within your organization's private network.

  6. Use your observability setup as much as possible for infra security - set alerts for 3X jumps in usual metrics (like requests per second, queries per second, operations per second, bandwidth consumption, etc, depending on the resource). Unless these are planned jumps due to some promotional (marketing) activities, they are usually an indicator of an attack. At the infra level, also do as much data validation and sanitization as possible and log/alert any suspicious values received or processed. Attackers often look for edge/corner scenarios of your infra setup to find the attack vector. If your setup is alerting for such a scenario, that will be the first major step in mitigating that attack and even pinpointing the source of the problem.

Data privacy

Have you ever hashed or encrypted your user data when you send them to third-party services such as Auth0 (and other identity management tools), Google Analytics (and other analytics tools like Mixpanel, clever tap, etc.), logging services (sentry, etc.), payment gateways, media providers and hundreds of other third party services you use while building your platform? If not, then irrespective of how secure you make your systems, you are still open to attack. Because now your data security is in the hands of these companies, the larger these companies are, the more susceptible they are to attack. The recent breach of Okta in September 2023 is a classic example of the same. Attackers were able to access data from other companies like Cloudflare, 1Password and BeyondTrust which use Okta's system. And to make matters worse, it was weeks before Okta could assess the magnitude of the data access by the attackers.

  1. Ensure that no sensitive user data is released to a third party. If this data needs to be sent, always encrypt it.

  2. Ensure that the communication between your system and the third-party systems is secure (and encrypted if possible)

  3. Even user behaviour data can be triaged to create an attack vector or to understand your system's internal workings. Be super careful about what you send.

Physical security and human factors

I was once in the offices of a fintech company, where anyone could just enter without any authorization, plug in a USB drive and copy data from any of the systems. Nobody would have even noticed this attacker if done during the correct time (for example, lunch hours). This is the case with a lot of companies that handle financial data. They don't invest heavily in physical security and human factors. The leak can be as simple as a Google sheet containing sensitive data and has poor sharing permissions where it is open to the public for viewing.

Social engineering refers to a trickery technique used to manipulate people into divulging confidential information or performing actions that benefit the attacker. It's essentially a con job that exploits human psychology and vulnerabilities.

If you read this blog on state of social engineering, you will learn that social engineering accounts for more than 90% of cyber attacks in the recent years.

You can take the following steps the following steps yourself and ensure that people in all departments of your organization are aware of them -

  1. Social engineering often relies on a sense of urgency or panic. Take a moment to breathe and assess the situation before acting.

  2. Be sceptical of unsolicited emails, calls, messages, or people approaching you in person. Don't assume they are who they say they are.

  3. Too good to be true? It probably is: If an offer seems incredibly enticing or a situation feels suspicious, it's likely a ploy to manipulate you.

  4. Don't click on links or attachments in emails from unknown senders. Verify email addresses carefully, even those seemingly from familiar sources. Phishing emails often mimic legitimate senders.

  5. Use strong passwords for all your accounts, and ideally, enable two-factor authentication (2FA) for added security. This requires an extra verification step beyond just your password.

  6. Keep your antivirus and anti-malware software up-to-date. These can help detect and block malicious software that might be downloaded through social engineering tactics.

  7. When entering personal information online, ensure the website is legitimate and secure. Look for the https:// prefix in the address bar and a lock symbol, indicating a secure connection.

  8. If you receive a call from someone claiming to be from a company (bank, tech support, etc.), don't give out personal information or grant remote access to your computer. Call the company directly using a verified phone number to confirm the call's legitimacy.

  9. Careful with Physical Documents: Be cautious about sharing sensitive documents or information in person. Don't leave them unattended or readily accessible to strangers.

  10. Social Media Savvy: Be mindful of what information you share on social media platforms. Attackers can use this information to personalize social engineering attempts.

  11. Stay Informed: Keep yourself updated on the latest social engineering tactics. Learning about common tricks can help you identify and avoid them. You can subscribe to the hackernews newsletter. It should be enough to update yourself on recent attacks.

  12. Double check your organization sharing setting and permissions in all the tools used. Enforce the correct permissions at global level (example, nobody should be allowed to share a document publicly irrespective of the circumstances).

Given that social engineering attacks are becoming sophisticated by the day, it is virtually impossible to list down all the actions possible. But the above steps should serve as a general thumb rules to follow in order to avoid getting hacked.

Backup and disaster recovery

I mean to be sarcastic when I say this - If I have to sell why automated backups, point in time recovery and general disaster recovery mechanisms are important for you, then you may not be the right audience for this blog post.

Let's be honest. Irrespective of how much you invest in your security, there is always a talented hacker (or a group of hackers) who will defeat every measure you have put in place. Your backup and disaster recovery setup should probably be the first item on the security checklist.

  1. Automate your backups to run at regular intervals so that you can sleep peacefully at night.

  2. Establish a data retention policy for each of the backup (and backup type) that can factor in both short term and long term goals of the organization. Remember, that there are archival storage solutions like S3 glacier that are available that allow you to retain a lot of data at a fraction of storage cost.

  3. If you are an enterprise, consider backing up data in different availability zones or even different cloud providers in order to be safe. Also, consider disaster recovery drills at a regular interval.

  4. Just don't assume that your backup automations are working correctly. Mark your calendars to check them at a reasonably regular interval (I check once a month).

  5. Document your backup strategy but keep the documentation to limited set of users. Otherwise, hackers may social engineer the backup strategy out of your team members and leverage that information to gain an upper hand in the attack. A lot of ransomware attacks do exactly the same.

Owasp security cheat sheet

While doing research for this blog post, I came across this interesting well-organized cheat sheet by OWASP. Do give it a thorough reading. It covers a lot of topics and is in sufficient detail.

Ending note

Cyber Security is a big business now. There are several billion-dollar companies that are working in the domain of cybersecurity, and I am sure there will be many more down the line. If you are not thinking about security today, then there is a very good chance that by the time you start thinking about it, it will be too late. If you are not actively doing security, the least you can do is to add it as a technical debt. Acknowledging it as a to-do will be enough because you will hear about new attacks and data breaches every month, and the panicky person inside you will force you to revisit that list.

I tried my best to list as much basic stuff as possible. This list will never be complete, but it should serve as a good starting point. And sometimes, that's all you need—a good starting point.

I hope the above system helps you as much as it has helped me. If you have any questions, doubts or suggestions, please contact me on Twitter, Linkedin or Instagram. Do share this blog post with your friends and colleagues.

Process and serve millions of images and videos efficiently - A media management system for social networks

· 19 min read
Aditya Kumar
Aditya Kumar

Media Management

Media files have become indispensable for all Web/Mobile/TV applications in today's digital-first world. Any online business such as E-commerce, EduTech, News and Media, Social Media, Dating, Online Food and grocery apps, and Hotel booking is immersive, captivating and addictive because of the media content such as Images, Videos, Audio, Live streams, etc.

Developers often approach media management as a part of a specific feature or service. Code for handling the media files like upload, fetch, edit, delete, cleanup, etc, is usually tightly coupled with the specific functionality even though these media files' underlying storage and delivery mechanism are essentially the same (cloud file storage solutions like Amazon S3 and CDN solutions for delivery). But what if I told you there is a better solution for your media management needs?

A solution that you can build once and use anywhere and anytime.

I have been building SaaS applications and social media platforms for a long time. The engineering complexity of handling media in these categories of platforms is a fascinating problem statement for two reasons -

  1. There is no limit on the functionality you need to build. These platforms typically have a wide variety of functionalities for serving the needs of different types of users, and many involve handling and serving media files.

  2. There is tons of user-generated content, and it's tough to reliably calculate how much media content a single user will consume or generate.

Let me take the example of basic functionalities in Instagram to give you a high-level idea of how media files are used across various functionalities -

  1. Your profile contains your profile picture.

  2. A single post can contain multiple photos or even videos.

  3. A story post can contain audio files and images.

  4. Reels contain video files.

  5. DMs can have images/videos/audio, and much more.

Instagram is a social network built around images and videos, so you may consider it an exception. But think about any typical online first business like E-commerce, Food delivery, ride-hailing, financial transactions or anything else. Most of them have profiles that contain media, products/offerings that contain media, etc.

Apart from this, media files have their own specialized set of requirements, such as -

  1. Compressing the files - Each media file type(video, images, audio) has a different compression need(usually different methodology).

  2. Creating different variants of the image files - Think of thumbnails and different sizes of the same post for optimized viewing on different devices and networks.

  3. Breaking down large video files into chunks for video streaming. (See HLS)

  4. Keeping track of metadata such as file sizes, extensions, etc.

  5. Keeping track of the file storage system used and the CDN provider used - I know most people use only one, but keeping track in case you need to change the vendor for any reason in the long run.

I have used a generic media management system for the last ten years, which has seen huge success in availability, reliability, manageability and extensibility in different use cases. It's also loved by developers in my teams because once programmed and stabilized, it's effortless to use it in new functionality or even make changes to the old functionalities.

In this post, I will give you a high-level idea of how to build a similar system for your organization. But first, let's look at the product requirements from an engineering perspective.

I will take Instagram as an example (wherever required) because it's highly relatable for most people. From an engineering perspective, we have to solve the following problems.

  1. Maintain a mapping of media storage structure in our database - This system is supposed to act as a generic media management system for all the media needs of the platform in the present and future. It should be able to track the individual media files associated with different entities (posts, stories, profile pictures) and store this association in a database.

  2. Store the information associated with the media files - The database entries should also track the file types, file extension, file size, creation and updation timestamps, CDN endpoints, relative URL, user identifier, entity Identifiers, soft and hard delete status, etc.

  3. Ability to map multiple media providers - we must keep a unique key to map the media provider (Amazon S3, Google Cloud Storage, Cloudflare, etc.) in the database. This will help us use multiple vendors, and we can switch based on the cost consideration whenever and wherever (for a particular feature) we want.

  4. Ability to generate signed URLs based on media providers for media access- We will write separate utility functions for each media provider to perform various functions like redirects, generating signed URLs, etc.

  5. Maintain a cache of signed URLs - Each Signed URL generated by the system will have X hours (or minutes) of validity. This will help us speed up media access using signed URLs, and we'll not need to generate a signed URL for each file access request.

  6. Generic file upload interface - The sytem should have a generic API/GraphQL (replace with whatever you are using) endpoint for file upload that is unaware of the folder structure and other storage requirements for each entity type and its variants. In our file storage service, we will keep a neatly organised folder structure for storing and mapping media files.

  7. Cleanup of signed URLs that were not used - We need to track when a signed URL was generated, but the media file was not uploaded using that URL. This will help us perform cleanups on our media database.

  8. Cleanup of files whose entities have been deleted - The system should be able to delete the files whose associated entities (post, comment, message, status, etc) have been deleted.

A few important side notes before you move forward -

  1. File extensions and File types should be standardized in your system. Don't keep the original file extension uploaded by the user. I recommend changing the media file to a standard extension according to the media type. This will give you a huge boost in client-side caching and rendering speed and help you in compression and storage optimization. Choose the format(extension) that supports your use case based on factors like what kind of network your users typically have (wifi/4G/5G), whether it's a mobile app or web app, what is the access pattern of the media for a typical user and how much media you have to render on a single screen or scroll.

  2. If you don't know what Pre-Signed URLs are and what are the advantages of using them, you should read up on them. In Short - they are a secure way to upload files directly from your client to your storage solutions like Amazon S3. This upload is very fast because it doesn't need to go through your server-side application. Similar URLs are also used in file delivery to ensure your media files are secure.

  3. If you don't know what Content delivery networks are, please read up on it.

  4. Media processing is a CPU-intensive activity and should be performed asynchronously. (I have explained this later)

  5. While maintaining the cache of Signed URLs, keep your end user behaviour and feature behaviour in mind. For example, if you are posting a story on Instagram, then you cannot change the media associated with the story, and also the stories are live only for 24 hours. So, your cache settings should be optimized for this.

Solution - High-level working of the media management system

The following diagram explains the high-level working of the system and all the components used.

Media Management Architecture

Following is how the system works -

  1. There are two services in the system. One is responsible for the CRUD operations related to the media files, called "Media management service", and another is responsible for all the processing and cleanup activities called "Media processing service".

  2. When a user wants to upload a file for any entity (post, status, message, etc.), the client application (web/mobile) fires an HTTP request to generate a secure Pre-signed URL for upload. The media management service performs all the necessary validation and authorization checks based on the data (user's unique ID, entity ID, rate limiting, etc) and generates a Pre-signed URL for upload. The Logic of this API endpoint for generating the signed URLs is aware of the base folder structure for the entity through a configuration file. The Client passes the required input parameters(entityId, entityType, variant name, etc.) and, in return, gets a signed URL, which can be used by the client to upload the file to the storage provider directly.

  3. The client uploads the file directly into the cloud file storage solutions (Amazon S3, Google Cloud storage, etc) using the secure Pre-signed URL. Since this upload is not going through our backend services, the client has to notify the media management service once the upload is finished. It passed the final file information received from the file storage solution in another HTTP request to the media management service.

  4. Once the file upload is confirmed, the media management service triggers a pub/sub event to notify the Media processing service. Upon receiving this event, the processing service starts processing according to the media type and the business logic for handling this file. It concurrently starts all the processing, like compression, thumbnail generation, image variant generation, HLS generation, etc. Please note that only the media processing service is aware of how each file should be treated, what processing has to be done and where and how to store the resultant files from the processing activity. Custom functions are written based on your organisation's product and business requirements for achieving this. Also, Media processing is usually both CPU and memory-intensive. So, the service should be deployed and monitored with these requirements in mind.

  5. When another user requests this file while consuming the content (post, message, story, etc.), it has only one piece of information - The unique identifier of the media object (mediaId) with it. I will cover why I have kept it this way in the low-level design (covered later in this post). So, all the entities (post, comment, story, messages) store only the mediaId(s) in the schema, and they are unaware of any other details of the media file. This de-coupling makes this system generic, and most of the magic lies in the system's low-level design, which we will cover later.

  6. So, the client makes the HTTP GET request to the Media management service using the mediaId (the unique identifier of the media) and also passes any other relevant information using query parameters. For example, in the case of Instagram, your profile picture can have multiple variants, like small, medium, or large, depending on where they are displayed in the app. If it is on the profile page, it's a medium-sized image. But if you are looking at the same image in the comments section or messages, it's a small-size variant. In case you don't know this, it's a prevalent practice to ensure the end-user experience is excellent in loading the media files, and it also saves tons of bandwidth costs on the company's end. So, in this case, a query would look something like GET <base url>/fetchMedia?variant=small. You can pass multiple query parameters as well.

  7. When this request to fetch a media file reaches the media management service, it performs the necessary validation, authorization and other security checks, and instead of responding with a JSON response, it directly does a 301 redirect to the URL of the media file stored in the CDN. Now, please beware that depending on the media processing speed of your system (Which is essentially a factor of how well the code is written, the infra-provisioned, and the concurrency handled by that service); there is a slight chance that the variant client requested was not available at the time of requests. In that case, the media management service should have logic in place to redirect to the original file. As you can imagine, numerous conditional statements would have to be written in the logic of this endpoint. These conditions depend on the type of media file requested and your corresponding CDN setup. So, it will first perform a lookup in some NoSQL DB to fetch the detailed record of the media using the mediaId and then execute the corresponding checks before the final redirect. But that's also an easy check because of the low-level design of the system (Discussed later).

  8. Another vital point to note is the synchronization of files between your media storage solution and your CDN provider. In most cases, it's supported automatically (for example, AWS S3 and Cloudfront), but if you are using a provider that doesn't support your storage solution, you must write the code for it in the Media processing service. You can leverage the same pub-sub event and add this as another activity.

  9. To ensure that the fetchMedia endpoint is extremely fast - you must cache the Pre-signed URL from the CDN provider. Also, be cautious about the expiration you set for your pre-signed URL and the cache. It should be decided based on factors such as your security needs, end-user access patterns, availability of correct file variants, etc. Also, please store the signed URL in the cache instead of the entire media object. This will ensure that your redirects are fast and you are not wasting storage. So, it's a simple key-value storage in the cache database.

  10. Your cleanup tasks should happen asynchronously in the media processing service, such as deleting lingering file entries that were not uploaded correctly and cleaning up media entries for deleted entities. You can use any cron-like scheduler for the cleanup tasks, and then it will just call a function that finds these entries in the table using database queries and performs necessary updates.

Now that you have an Idea about how this system works let's discuss the low-level design of this system.

Low-level design and the reasoning for it

This system requires you to have only one table/collection called MediaInformation.

// Schema
// Table Name - MediaInformation

mediaId - string - a unique identifier for this entry - 20 chars
userId - string - unique identifier of the user who upload the media file in the system - 20 chars
fileName - string - 20-25 characters
createdOn - number - epcoh timestamp in milliseconds
lastModifiedOn - number - epoch timestamp in milliseconds
entityId - string - upto 50 characters - unique identifier of the entity which this media is related - example are post/message/story/comment
entityType - string - upto 50 characters - type of the entity which this media is related - example are post/message/story/comment
variantName - string - which variant is this entry - example - original/compressed/thumbnailSmall/thumbnailLarge etc
mediaIdVariantIndex - string - a field that combines the variant and id into a single string for indexing purposes
storageRelativePath - string - upto 100 characters
storageProvider - string - example values are AmazonS3 and CloudFlare
cdnRelativePath - string - upto 100 characters.
cdnProvider - string - example values are CloudFront and CloudFlare
mediaType - string - example values are image/audio/video
mediaSize - number - in Kbs
fileExtension - string - example values are jpg/jpeg/png/mp4/mp3/wav
mediaStatus - string - example values are signedUrlGenerated, live, unpublished

Most fields in the above schema are self-explanatory. So, let's discuss the ones that aren't -

  1. entityId and entityType - I mentioned this several times in this blog post. The point of the system is to be generic. So, every type of media you store can be called an entity, and it will have an entityType and entityId (unique identifier). In the Media management service and Media processing service, we have to store all the possible values of entityTypes in the form of a configuration file. All the processing functionality will also be tied to what entityType we are processing. For example, in the case of Instagram, you can have entityTypes such as profilePicture, postPicture, postVideo, postAudio, storyPicture, storyVideo, messagePicture, messageVideo etc., and the entityId will be the unique identifier from the corresponding tables of these entities. For example, for a video in a post, the entityType will be postVideo and entityId will be the postId (unique ID in a table called post).

  2. Variant name - This field stores whether this media entry is the original file uploaded by the user or a processed version. Again, both the media services (management and processing) have to keep track of all the possible variant values for a particular entity type. This depends entirely on your use case and the different screens where this media is rendered. For example, OTT platforms must generate several variants of the same media file for a better rendering experience on different devices.

  3. mediaIdVariantIndex - This is the hack I have used in several systems. Understand your end user access pattern. For each media, in 99% of the cases, the client will pass the mediaId and the variant name. So, by introducing an additional string field, which is nothing but a concatenation of the values stored in the mediaId field and variantName field, you can create a global Index on your table to speed up the query. I know many modern databases have come up with compound indexes based on two fields built into the database engine, but in a distributed database environment, a simple index will always give you a better performance(for example - less than five milliseconds of read latency). Also, this index will allow you to check whether a specific variant for a media file is available or not using a single database lookup (I mentioned this in the previous section - point 7 of how this system works).

  4. Relative paths for storage and CDN provider - If you are still storing absolute paths, please don't. There is no upside and several downsides!

  5. Name of the media provider and CDN provider - These fields ensure the system is generic. Your organization can have file storage and delivery distributed across the globe, and you may need different providers in different regions for several reasons like cost, operational ease, availability of edge locations, etc. When the fetchMedia endpoint is called, it will take the Base URL of the file based on the media provider and the CDN provider. These Base URLs should be stored in a simple configuration file for speed.

  6. File information fields - I have only mentioned two fields for file metadata. But in real-life production environments, you must store much more information according to the media type. For example, in the case of video/audio, you may need to store the FPS(Frame per second). In the case of Images, you are supposed to store the resolution. Please extend the same schema to include all these additional fields according to your use case. Also, if any of these fields play a pivotal role in the end-user access patterns, then you know what to do -> Create indexes to support faster reads.

A few critical pointers -

  1. This system requires the application level to hold many configurations. I know some engineering leaders are fundamentally against this practice and would instead store the configuration in a database to avoid accidentally breaking the system. However, in principle, I don't agree with that approach as it introduces an additional overhead (an extra database lookup), which slows down the system. But if you are one of these leaders, you can use any one of the following suggestions as a middle ground -

    1. Keep the configurations at the Infra level and let developers call them using environment variables to avoid accidental manipulation. Build configuration first applications.

    2. Set up unit tests and rules in your CI/CD pipelines to avoid bugs arising from the changes in such configuration.

  2. I have given an example of just one indexing strategy that works in the case of generic systems. But your indexing strategy must match your end user access patterns if you are building media-heavy applications such as social networks, OTT platforms, etc. You may also consider using different tables for mission-critical file types. For example, an OTT platform should store the video data in dedicated tables optimized for delivery.

Tools and technologies used

  1. In our case, the services have been written in Go programming language. But you can choose any depending on your scale, developer comfort and cost considerations.

  2. We use Google pub/sub for our asynchronous communication because, at our current scale (few lakh total users, thousands of daily active users), it's highly cost-effective. You can use any alternative like Kafka, ZeroMQ, NATS, AWS SQS, etc.

  3. Redis - Speed, simplicity and robust data structure. I don't think I need to discuss why redis in 2024.

  4. DynamoDB - Again, it is highly scalable and easy to use, and we run it in serverless mode where, despite hundreds of thousands of queries per minute, our total bill is relatively low. It also offers powerful indexing capabilities and single-digit millisecond latency in reads and writes. I would highly recommend using it for this use case. But you can always use any other NoSQL or wide-column database with similar indexing capabilities and speed.

  5. Amazon S3 is used for file storage, and Amazon CloudFront is used for content delivery. Again, you can use any other solution. That's kind of the point of this blog post.

Ending note

So, we solved all the problems mentioned in the problem statement section using simple architecture and low-level design magic. I have intentionally not covered a lot of things like -

  1. How do you process different types of media files?

  2. How do you decide what information to store about each type of media?

Because those are vast topics in themselves and require their blog posts, I am sure I will be covering them soon. This blog post aims to give you an idea about how to do media serving, management and processing effectively at scale. But, if you are going to use this media management system in your organization and want answers to those questions, don't hesitate to get in touch with me. I hope the above system helps you as much as it has helped me. If you have any questions, doubts or suggestions, please contact me on Twitter, Linkedin or Instagram. Do share this article with your friends and colleagues.

Building a Basic Recommendation Engine: No Machine Learning Knowledge Required!

· 13 min read
Aditya Kumar
Aditya Kumar

Recommendation engine

Recommendation systems have become an integral and indispensable part of our lives. These intelligent algorithms are pivotal in shaping our online experiences, influencing the content we consume, the products we buy, and the services we explore. Whether we are streaming content on platforms like Netflix, discovering new music on Spotify, or shopping online, recommendation systems are quietly working behind the scenes to personalize and enhance our interactions. The unique element of these recommendation systems is their ability to understand and predict our preferences based on historical behaviour and user patterns. By analyzing our past choices, these systems curate tailored suggestions, saving us time and effort while introducing us to content/products that align with our interests. This enhances user satisfaction and fosters discovery, introducing us to new and relevant offerings that we might not have encountered otherwise.

At a high level, developers understand that these algorithms are powered by machine learning and deep learning systems (interchangeably called neural networks), but what if I tell you there is a way to build a recommendation engine without going through the pain of deploying your neural net or machine learning model?

This question is specifically relevant in the context of early and mid-stage startups because they don't have tons of structured data to train their models. And as we already know, most machine learning models will not give accurate predictions without proper training data.

I recently built and deployed a basic recommendation engine for a voice-first social network, which led to a 40% jump in our key metrics. At the time of writing this blog, the system is generating more than 30 million recommendations per month. Even though this recommendation system was built for a social network, you can apply the basic architecture to any use case, such as product recommendations, music recommendations, content recommendations on text and video platforms, or anything else. Let me start by describing the problem statement.

I had an extensive product requirement document and subsequent engineering requirementsdocument because we were building the recommendation system for a product that is already used by thousands of users daily. But to keep this blog short and on point, I will write only the high-level requirements and then discuss the solution of the same. If you are building a recommendation system for your product (simple or neural net-based) and are stuck somewhere, please feel free to contact me on Twitter or Linkedin, and I will be more than happy to answer your questions.

At a high level, we had the following requirements from an engineering perspective -

  1. The system should be able to capture a user's interests in the form of keywords. The system should also be able to classify the level of interest a user has with specific keywords.

  2. The system should be able to capture a user's interest in other users. It should be able to classify the level of interest a user has in content created by another user.

  3. The system should be able to generate high-quality recommendations based on a user's interests.

  4. The system should be able to ensure that the recommendations already viewed/rejected by the user shouldn't re-appear again for X number of days.

  5. The system should have logic to ensure that the posts from the same creators aren't grouped on the same page. The system should try its best to ensure that if a user consumes ten posts (our page size), all of those should be from different creators.

  6. The system should be fast. Less than 150 milliseconds of P99 latency.

  7. All the other non-functional requirements, such as high availability, scalability, security, reliability, maintainability, etc, should be fulfilled.

Again, this is a highly oversimplified list of problem statements. In reality, the documents were 3000+ words long as they also covered a lot of edge cases and corner cases that can arise while integrating this recommendation engine into our existing systems. Let's move on to the solution.

Solution - High-level working of the recommendation engine

I will discuss the solutions to the problem one by one and then will describe the overall working of the entire system.

Our first problem is capturing the user's interests and defining their interest level with a specific interest.

For this, we created something called a social graph. To put it simply, a social graph stores the relationships and connections between different entities in a social network. These entities can be different users or a relationship of users with a specific interest. Social graphs are a powerful way to understand and structure the relationships within a particular system. For the sake of brevity, I will not explain the social graph in detail, but I will recommend you google it and learn more about it. Following is a simplified version of the social graph I built for our recommendation engine.

Social graph

As you can see from the above image, we are storing a lot of information, such as the number of interactions (likes, comments, shares) and recency of these interactions (when they happened last) as relationship data between two users as well between a user and an interest. We are even storing the relationship between two different interest keywords. I used Amazon Neptune, a managed graph database by AWS, to store this social graph. You can use any other graph database, such as Neo4j, JanusGraph, ArrangoDB, etc.
These interest keywords are predominantly nouns. There is a system in place that breaks down the contents of a post into these keywords(nouns). It's powered by AWS comprehend a natural-language processing (NLP) service that uses machine learning to break text into entities, keyphrases etc. Again, you can use any managed NLP services (several available) to accomplish the same. You don't need to learn or deploy your own machine-learning models! If you already understand machine learning, then you can go check open-source NLP models as well on Huggingface.

Our second problem is generating high-quality recommendations based on a user's interest.

The following diagram is a simplified high-level representation of how the system works.

Basic steps

While the above looks easy, there is a lot more going on at each step, and those things have to be carefully thought through and then programmed in order to ensure that the system is performing optimally. Let me explain step by step.

Step 1 - Converting post content into vector embeddings

To generate these recommendations, first, we have to convert the contents of a post into something called - Vector embeddings. With the recent uptick in the branding of LLMs, OpenAI( the makers of ChatGPT) and Vector databases, Vector embeddings are becoming an everyday term. I will not go into the details of what they are and how they work, but I highly recommend reading more about them. But generating viable candidates for a feed also has to account for things like content privacy and moderation (removing profane words, abuses, sexual content, harassment, filtering blocked users, etc).

For generating the vector embeddings, you can use any prominent embedding model such as the OpenAI embedding model, Amazon titan or any open source text embedding model, depending on your use case. We went with Amazon Titan because of its friendly pricing, performance and operational ease.

Step 2 - Query the user's interest

Now, this is where things get interesting. You would want to design the queries based on your specific business needs. For example, we give more weightage to the recency of engagement while querying interests than the number of engagements with a specific keyword or user. We also run multiple parallel queries to find different types of interest of the user - keyword or other user. Since we generate multiple feeds for a single user, we also run some queries promoting a specific topic according to the trend (for example, you will see many Christmas-related posts near Christmas or earthquake-related posts if some earthquake has happened). Needless to say, this topic will only come up in the query results if the user has expressed some interest in them in their journey.

So, choose the logic that suits your business use case and the behaviour that you want to drive and run multiple queries to get a big enough list of all the user's interests.

Step 3 - Do an ANN search based on the interests found

Vector databases are predominantly used for performing a particular type of search called Approximate nearest neighbour search(ANN). Again, the way you categorize various interests and whether you are doing one big ANN search or parallel difference searches should entirely be based on your use case and business requirements. I recommend doing more than cohort-based searches and then ordering the results (we will discuss this later in this blog) for the best end-user experience. What ANN search does, in this case, is find other posts on the platform which are similar (closer) to the interests of the user.

Step 4 - Store the results in a cache database with ordering.

Cache database because one of the problems that we need to solve is speed. We used redis sorted sets for storing the unique IDs of the posts for a specific user. We used redis sorted sets because the order of posts in a user's feed is critical. Also, another problem that you have to solve is that the" system should have logic to ensure that the posts from the same creators aren't grouped on the same page". To avoid repetition of content from the same creator, we have written a simple algorithm which ensures that if a specific creator's post is inserted at any position in a particular user's feed (sorted set), we don't insert another post from the same creator for successive ten positions (we have a page size of 10 while serving the feed to the end user, so we kept it static to avoid complexity).

For deciding the order of a specific recommendation of the user, we factored in the following things -

  1. The strength of the relationship with a specific interest (or another user) for this user: It's determined by an arithmetic formula that takes various data points from the social graph. All of this is engagement data like the timestamp of the last likes created, number of likes created, last comment, etc. User engagement behaviour is the indicator of their interest in something.

  2. The popularity of the post on the platform: To determine this, we have created an algorithm that takes various factors such as engagement, engagement-to-impression ratios, number of unique users who engaged, etc., to generate an engagement score of that post at a platform level.

In some feeds, we prioritize popularity; in others, we prioritize the social graph. But mostly, all of them are a healthy mix of the two.

Working of the system

Recommendation working of the system

As you can see from the diagram above, the system has been intentionally kept very simple. Following is how the system works -

  1. When user A creates a post, the post service, after saving that post, triggers a pub/sub event to a queue, which is received by a background service meant for candidate generation. We use Google Pub/Sub for the pub/sub functionality.

  2. This background service receives this asynchronously and performs functionalities discussed earlier - Privacy checks, moderation checks, and keyword generation and then generates the vector embeddings and stores them in the vector database. We are using AstraDB as our vector database (discussed later).

  3. Whenever a user engages (like/comment/share, etc.) after updating our main NoSQL database, the post-service triggers a pub/sub event to the recommendation engine service.

  4. This recommendation engine service updates the graph database and then updates the recommended feed of the user in near real-time by performing the ANN search and updating the Redis database. So, the more users interact, the better the feed keeps getting. There are checks to ensure that the recommendations are not biased towards a specific list of keywords. Those checks are performed while we query the Graph database. This service also updates the engagement score asynchronously. Engagement scores are re-calculated on users viewing the post as well.

  5. Since all of the above steps are performed asynchronously behind the scenes, these computations have no impact on the end-user experience.

  6. The feed is finally served to the end user through a feed service. Since this service just performs a lookup on redis and our main NoSQL database (DyanmoDB), its P99 latency is less than 110 milliseconds. Both these databases return query results in single-digit millisecond latency irrespective of scale.

Tools and technologies used

  1. Some services have been written in Go programming language, while others have been written in NodeJS(with typescript).

  2. We are using AstraDB by Datastax as our vector database. We arrived at this decision after evaluating multiple other databases, such as pinecone, milvus and weaviate. Apart from its excellent query and indexing capabilities on vector and other data types, it offers a pocket-friendly serverless pricing plan. It runs on top of a Cassandra engine, which we use as a database in several other features on our platform, and it gives a CQL query interface, which is very developer-friendly. I highly recommend trying it for your vector use cases.

  3. We use Google pub/sub for our asynchronous communication because, at our current scale (few lakh total users, few thousand daily active users), it's highly cost-effective. I have run it at a scale of a few lakh users with thousands of events per second. It works well, and it's effortless to use and extend.

  4. Redis - Speed, simplicity and powerful data structure. I don't think I need to discuss why redis in 2024.

  5. DynamoDB - Again, it is highly scalable and easy to use, and we run it in the serverless mode where, despite hundreds of thousands of queries per minute, our total bill is quite low. It also offers very powerful indexing capabilities and single-digit millisecond latency in reads and writes.

Problems to be solved in the future

As you can imagine, this same setup can be tweaked to build a basic recommendation engine for any use case. But, since ours is a social network, we will require some tweaks down the line to make this system more efficient.

  1. Machine learning/ Deep learning algorithms will be needed at the social graph level to predict the keywords and users most relevant for the user. Currently, the data set is too small to predict anything accurately as it is a very new product. However, as the data grows, we will need to replace the current simple queries and formulas with the output of machine learning algorithms.

  2. Relationships between various keywords and users must be fine-tuned and made more granular. They are at a very high level right now. But they will need to be deeper. We will need to explore the second and third-degree relationships in our graph to refine the recommendations first.

  3. We are not doing any fine-tuning in our embedding models right now. We will need to do that in the near future.

Ending note

I hope you found this blog helpful. If you have any questions, doubts or suggestions, please feel free to contact me on Twitter, Linkedin or Instagram. Do share this article with your friends and colleagues.

How to decide if serverless is right for you?

· 10 min read
Aditya Kumar
Aditya Kumar

Serverless Architecture

The fundamental idea of serverless is eliminating the need to think about servers. If you are not thinking about (or working towards) managing the infrastructure, you get more time to focus on building your applications and serving the customers better. Let's understand this better by answering the question - What types of serverless offerings are available in the market?

Whenever I discuss the word serverless with developers and engineering leads, they automatically start talking about tools such as AWS Lambda, Google Cloud Run, etc. For some reason, compute serverless services have better brand awareness. This is where the problem starts. Compute is just one of the use cases of serverless. Let me give you a quick overview of the serverless categories in the market.

1. Compute

In serverless computing, you will find various cloud providers' offerings. Those offerings are -

  1. Serverless functions - AWS Lambda, Google Cloud Functions, Azure functions, etc.

  2. Serverless container runtime - Google Cloud Run, AWS Fargate, etc

  3. Managed Kubernetes engines - Amazon EKS, Google Cloud GKE, Azure Kubernetes service, etc.

2. Databases

Serverless databases are all the rage nowadays. Most serverless databases offer the usual features - scalability, cost-saving, etc. Still, the real fight between them is on the factor that is the ultimate nirvana for every developer. It's single-digit millisecond latency at any scale. That means, irrespective of what command(CRUD) you perform on your database, you will get the result in less than ten milliseconds (unless you are using anti-patterns). To summarize, you don't have to manage the servers, you don't have to worry about high availability, throughput, etc, and you still get the single-digit millisecond latency. Who wouldn't go for that, right?

We will discuss what's the catch in this situation in the last section of this post, but for now, the following is the list of the most popular Serverless databases available in the market -

  1. Amazon DynamoDB (NoSQL)

  2. Amazon Aurora serverless (SQL)

  3. Azure CosmosDB (Both NoSQL and SQL options available)

  4. Azure SQL database serverless

3. Data warehouse

Data warehouse databases and engines are optimized for running analytics queries. If you don't know the difference between OLTP and OLAP databases, google it.

Google BigQuery is a serverless data warehouse - https://cloud.google.com/bigquery

Amazon has also launched Amazon Redshift serverless. Several other cloud providers are racing towards their serverless data warehouse offering.

4. REST APIs and GraphQL

Many services in the market offer serverless management and deployment of REST APIs/GraphQL endpoints. Some examples are - AWS Appsync, Amazon API Gateway, Azure API Management, Firebase cloud functions, etc.

Most of them work with one or more of the above services (computing, databases, etc) to get the results. The idea is to simplify managing and deploying these everyday use cases, such as REST APIs and GraphQL.

5. Message queues and pub-sub

I come from the era where you had to scale up your own RabbitMQ and Kafka setup; trust me, it was a nightmare. So, serverless message queues are one of my favourite cloud products. Such services include Amazon SQS, Google pub/sub, etc.

If you don't know what message queue and publisher-subscriber pattern are, go through this link

6. Media storage

Media storage services like Amazon S3 have been around for a long time. But people often forget these services are the "OG" serverless offering". These services are the ones that inspired the idea behind serverless computing and serverless databases.

Examples of such services are - Amazon S3, Cloud storage (by Google), Azure blob storage, etc.

7. Edge CDNs and networks

Although classifying Edge CDNs as serverless is a bit thin, as they are primarily known for only one thing - "Content delivery" but again, they are not used for just content delivery, and they offer all the core functionalities that serverless promises like scalability, no operational overhead, ease of use, etc. So, technically, they also fall under the paradigm of serverless computing.

If you don't know what CDNs and Edge networks are, you can learn using the following links -

  1. https://www.cloudflare.com/en-in/learning/cdn/glossary/edge-server/

  2. https://vercel.com/docs/concepts/edge-network/overview

Edge CDNs started with the use-case of content delivery, but now you can run serverless functions, APIs, etc, on edge as well.

Why does serverless exist? - The problem statement

If I have to summarize and oversimplify things, the following high-level reasons should be enough to answer the question -

  1. To save cost - serverless is usually cheaper when your application traffic is inconsistent and less. A "pay only for resources used" pricing model usually charges you for memory, CPU, and other resources that are consumed by your app. Cost is generally marketed as the biggest reason for choosing serverless.

  2. Auto-Scaling: Serverless platforms automatically scale resources up or down based on demand. It allows applications to handle varying workloads without manual intervention.

  3. Less operational complexity - Anyone who has ever tried to scale up compute, database servers, or something as simple as a reverse proxy (such as Nginx) for highly concurrent workloads knows how complex (and ugly) operational overhead can be. Serverless aims to eliminate this operational overhead and make things simpler for developers.

  4. Improved developer experience - Boosting developer experience is the new motto of enterprises and startups. Have people in management fully understood and experienced the wrath of pissed-off developers? (kidding, of course). But yes, people have understood the fundamental principle that "Time is money". And when paying hundreds of thousands of dollars yearly to a single developer, time is literally money. Serverless promises increase development velocity and improve the developer experience by freeing them from infrastructure provisioning, management, Ops, etc.

There are many other advantages marketed by cloud providers while selling these solutions, but almost all are subsets of the four points mentioned above.

How can you decide if the serverless solution is good for you?

We are now addressing the elephant in the room. Now, this is where people generalize an entire spectrum of products and offerings by throwing statements like "serverless is suitable for small workloads" or "it's not fit for production environments". And that is a lazy way to think about things.

The most critical metric in choosing between serverless and provisioned solutions is cost. Everything else - Auto scaling, less operational overhead, and improved developer experience- are promises made by provisioned solutions as well. And cost in the cloud is generally a function of -

  1. Compute - the amount of processing power required and usually measured in terms of number of cores and types of processors.

  2. Memory - Amount of memory (RAM - if you want to call it that) required and usually measured in MiB/GiB. (or, in some cases, memory units).

  3. Storage - The storage required for the solution is measured in MB/GB/TB, etc.

  4. Data transfer cost - The most dangerous if not accounted for and usually measured in GB/TB etc.

There can also be other factors when computing the cost for a specific offering. It requires you to go through the pricing page of that solution and understand the pricing logic well. But in general, the cost predominantly depends on the above four factors.

But if you think about it - The choice between provisioned and serverless (unless there is some difference in certain functionalities) can be as simple as estimating things and doing a few simple mathematical calculations. All you have to do is -

  1. Capture the engineering requirements of the system wellUse this battle-tested ERD template to structure your engineering requirements well and do the capacity planning efficiently.

  2. Do Back-Of-The-Envelope Estimation / Capacity Planning to determine the system's load requirements - Choose a metric that best fits your problem. It can be anything such as the "number of concurrent requests/read/writes", "total data transferred", "Daily active users", "Monthly active users", etc. It depends highly on the problem statement you are solving and the solution you are designing for that. (Google this topic and learn it in depth if you don't know what it is).

  3. Figure out your expectations from the solution - Cost effectiveness, speed, performance, scalability, low operational overhead, high availability, security, automated backups, observability, maintainability, etc. - Although it's very rare, there may be some functionality differences between provisioned and serverless solutions offered. If your expectations are apparent from the beginning, you can avoid surprises down the line.

  4. Compare the serverless and provisioned solutions based on the results of the above three steps and see which one fits your needs best in the short term and relatively long term (Relative to the nature of your business and the growth curve it can have).

But what about the usual problems with serverless solutions - The usual suspects

Whenever I discuss serverless vs provisioned with people, I get a list of "usual problems with serverless," which are used as an excuse not to think about the solution. So, to save time in the future, I will take a few extra minutes to address those issues in this blog post and be done with this topic. Following is the list and my reply -

  1. Cold start problem - The first and foremost reason people tout for not considering serverless is a cold start. But in 2024, there are only a few serverless solutions (for example, serverless functions) that suffer from cold start problems, and there are several ways to circumvent that problem and still be pocket-friendly. In fact, serverless solutions like DynamoDB offer a single-digit millisecond latency for all operations. So, stop generalizing and solve the problem based on the above-mentioned steps.

  2. Resource Limitations - Again, it's 2024 and most serverless offerings, including the serverless compute offerings, have a very high resource limit. 95% of the use cases don't need processing/memory/storage beyond what is already available.

  3. Uncertain Pricing - This is what this entire blog post is all about, so read it and understand that the pricing is only uncertain if you don't do the math.

  4. Debugging and Testing - This problem was specific to serverless functions, but there are now frameworks such as serverless framework(link) to solve this problem for serverless functions. You will not face this problem with other serverless offerings.

  5. Vendor lock-in - Another famous reason for not using serverless. But I can't refute this in a single statement because it's a big discussion on its own. I will write a separate blog post about it and update the link here once it's live. But, at a high level, the vendor lock-in problem is not as severe of a problem as people think.

Ending note

I hope this post pushes you to explore serverless as an option for your cost optimization and developer experience needs. If you do the math correctly, you can reduce costs by 90% for the same amount of scale in most solutions (Especially databases, storage and queues).
If you have any questions, doubts or suggestions, please feel free to contact me on Twitter, Linkedin or Instagram. Do share this article with your friends and colleagues.