Understanding The Product Lifecycle

DoubtNut's upload fix crashed 10% more users the next day. Their auto-play feature drained students' daily data allowance in minutes. Three failures, one cause -- building with the wrong picture of the user.

June 7, 202610 min read1 / 4

The previous post set up the problem: a significant share of DoubtNut's daily users were failing to upload their question photos because of poor network quality. Fix this before anything else.

What I did not tell you is what happened when we actually tried to fix it.

The First Instinct: Compress the Image

The obvious approach is compression. If the image is too large to upload on a slow connection, make it smaller.

That works for photos of people and landscapes. It does not work for photos of textbook questions.

A DoubtNut photo contains printed text -- equations, diagrams, handwritten numbers. When you compress that image, you degrade the resolution of the characters. The OCR system reading the question now receives a blurrier input. A smudged "2" becomes a "7". A fraction becomes unreadable. The system maps the degraded text to the wrong video.

The user gets a solution to a different problem.

From their perspective, the app answered the wrong question. They do not know the image was compressed. They do not know OCR was involved at all. They think the app is broken.

They uninstall.

Compression trades one failure mode for a worse one. An upload failure is at least honest -- the user knows something went wrong. A wrong answer destroys credibility silently. The user leaves convinced the product does not work, not that their network was slow.

DoubtNut's upload routing: fast networks send to backend OCR pipeline; slow networks trigger on-device OCR, which hung 512MB-RAM devices and forced an emergency rollback ExpandDoubtNut's upload routing: fast networks send to backend OCR pipeline; slow networks trigger on-device OCR, which hung 512MB-RAM devices and forced an emergency rollback

The Real Solution: Know the Network

The solution we eventually built started with one observation: the quality problem only existed on poor networks.

On a fast connection -- 4G or good Wi-Fi -- sending the full image to the backend server was always the right approach. Our backend had a purpose-built OCR pipeline we had spent months tuning. It produced the most accurate question extraction possible. There was no reason to compromise on it when the network could handle the payload.

The weak network case was the only one that needed a different path.

Here, the full image upload would time out or fail entirely. The fallback: extract the text on the device itself, and send only the text to the backend. Text is orders of magnitude smaller than an image. Even on 2G, a text payload goes through.

Google's Firebase ML Kit provided the tool -- an on-device OCR library that could read text from an image without making a network call. We used lazy loading: the library was not bundled in the initial APK. After the user installed the app, the library downloaded silently in the background. By the time a user hit a poor-network session, the SDK was already on their device.

The routing logic became:

  • Detect network quality at the moment of upload
  • If strong: send full image to backend pipeline
  • If weak: run on-device OCR, extract text, send text only

We ran this experiment for two to three months. On good networks, the backend pipeline was more accurate. On poor networks, on-device extraction was good enough to return a usable answer and keep the user in the product.

It felt like we had solved it.

What We Got Wrong

We had not tested on the devices our users actually owned.

DoubtNut's users -- students in Tier 3 and Tier 4 cities, on 2G and 3G connections -- were not carrying flagship Android phones. They had budget devices. Many of them had 512MB of RAM.

When a device with 512MB of RAM lazy-loaded the Firebase ML SDK and tried to run on-device OCR, it hit a wall. The SDK consumed most of the device's available memory. Other processes were starved.

The phone started hanging.

From the user's perspective: they took a photo, waited, and the app froze.

We had to roll back entirely and push an emergency update.

The lesson is not that client-side ML is wrong. It is that every architectural decision carries hidden assumptions about the device it runs on. We designed the fallback with a mental model of a phone that did not match the phone in our user's hand.

This was 2017-2018. Mobile ML libraries were young. Device fragmentation was severe.

On-device AI tooling was nowhere near what it is today. Those constraints shaped every decision.

The right question to ask before implementing any client-side processing: what is the lowest-specification device this will run on, and have I actually tested on it? If the answer is "I am not sure," that is already the answer.

The Number That Made It Real

Before the fix, we were losing 30% of new users every day -- the ones who could not upload their question and left immediately. That number was painful, but at least the cause was obvious.

After the fix, we started losing 40% of users.

The solution had made things measurably worse. Users whose devices hung during the ML loading phase did not think "my phone has too little RAM." They thought "this app is buggy" or that it had picked up a virus. They uninstalled. The problem had not moved -- it had grown.

This is what happens when you build without knowing your user's device. Not a theoretical risk. A 10-percentage-point increase in daily user loss.

Know Your User's Device Before You Write the First Line

The way to avoid this is not clever architecture. It is data.

Tools like Firebase Analytics, Mixpanel, and Amplitude track which devices are accessing your product, what operating system versions they run, and what network conditions they are on. This data existed for DoubtNut. We were not looking at it carefully enough before we shipped.

If we had mapped our user base by device before implementing client-side ML, we would have seen the 512MB RAM cohort immediately. We would have known that the lazy-loading fallback could not run on the majority of devices it was designed to help.

Before you design a solution, describe your user's device. Then go find one. Then test on it.

WhatsApp's Lesson on Constraint Engineering

DoubtNut is not the only product that had to learn this the hard way. WhatsApp did too -- at a scale that makes DoubtNut's problem look contained.

During its Indian scaling phase, WhatsApp had approximately 400 million daily active users in India -- a number that has since grown to over 500 million monthly active users today. A significant portion of those users connect on 3G or slower networks, on devices that cost less than Rs. 7,000-7,500 -- budget Android handsets with limited processing power and small storage.

During WhatsApp's scaling phase in India, the team had to solve a question that sounds simple and is actually one of the hardest problems in mobile engineering: how do you make a messaging app feel instant on a 3G connection?

If a message you send sits with a spinning indicator for three seconds before confirming delivery, you start to wonder whether it went through. You send it again. The other person receives it twice.

For hundreds of millions of people in India where WhatsApp was trying to become the default communication tool, that uncertainty would have killed adoption.

WhatsApp is the dominant messaging app in India not because it was the first or the prettiest. It is because it was the one that worked reliably when the network was not.

YouTube built the same moat in video. In 2017, YouTube's video initiation time was under 0.5 seconds on a 3G network -- the gap between tapping play and the first frame appearing. On Google Drive on the same connection, buffering was almost immediate.

That engineering difference is one of the main reasons YouTube became the default video platform globally while Drive remains a storage tool.

The Second Wall: The Video That Would Not Play

After solving -- or attempting to solve -- the upload problem, we assumed the hard part was done. Upload the question. Get the right video. Done.

Then students started telling us the answer was not playing.

The videos on DoubtNut were recorded in 720p HD. On a strong connection, they streamed fine. On 3G, tapping play and waiting 10 seconds was normal.

Students would wait, see nothing happen, and press back.

From their perspective, the app had found the answer and refused to show it to them. The OCR had worked. The right video had been matched.

Delivering the right answer and failing to show it is functionally the same as giving the wrong answer.

The Auto-Play Rollback

The instinct, when video initiation time is too slow, is to start loading the video before the user asks for it. If the first result is pre-buffered by the time the user reaches the solution page, initiation time drops to near zero.

We shipped this. We had to roll back the same day.

The problem was data.

In 2017-2018, most of our users were on prepaid mobile data plans. The standard recharge was around Rs. 258 for one month of data with a 1GB daily limit. When a student landed on a solution page and four or five videos began auto-loading in the background, their daily data allowance evaporated in minutes.

Users who ran out of data could not come back. They could not ask more questions. One-star reviews came in immediately -- not about the videos, but about data usage.

A feature designed to improve the experience had reduced the number of sessions per user per day.

A feature that works perfectly for one user can be actively harmful for another. Auto-play is a normal, expected behaviour on Netflix and YouTube because their users are largely on home Wi-Fi or unlimited data plans. For a student in a Tier 3 city on a 1GB daily prepaid limit, auto-play was an expense they had not consented to.

The mistake was borrowing a pattern from products built for a different user and assuming it would transfer cleanly. It did not.

We eventually partnered with Airtel to offer students a subsidised data plan, but that was a business deal, not a product solution. The product solution had to come from understanding what streaming looks like when data is genuinely scarce -- which is what we later learned from YouTube's team directly through workshops on buffering reduction and adaptive bitrate streaming.

What All of This Actually Teaches

Three separate failures. One cause.

The compression failure, the device hanging failure, and the auto-play data failure all trace back to the same root: building with a mental model of the user that did not match the user who actually showed up.

The student asking a maths question at 10pm on a 3G connection, on a phone with 512MB of RAM and a 1GB daily data cap, is not the user most developers picture when they are writing code. But that was our user. Every architectural decision that did not account for that person -- whether on image compression, client-side ML, or video pre-loading -- created a failure we had to fix under pressure.

Product management, at its core, is the practice of forcing that picture to be accurate before the code is written. Not after the rollback.

The Essentials

  1. The wrong user model destroys products. Three separate failures at DoubtNut -- compression, device hanging, data drain -- all traced to building with a picture of the user that did not match reality. The fix was always the same: know the actual device, the actual network, the actual data plan.

  2. Compression trades one failure for a worse one. An upload failure is honest -- the user knows something went wrong. A wrong answer destroys trust silently. The user leaves convinced the product is broken, not that the network was slow.

  3. Test on the lowest-spec device in your user base before shipping any client-side processing. Not on dev machines. Not on the founder's phone. On the actual device. That ten-minute check would have prevented a multi-week rollback.

Further Reading and Watching

All three failures in this post -- compression, device hanging, data drain -- share the same root. The next post introduces the framework that explains why: the four stages of a product's life, and how each stage changes what you should be building and how carefully you should be building it.