imaga: Navigating MVP Launches as a Software Engineer: Essential Tips for Success

Alex Polovinkin

Deputy Director of the Development Department

Hello everyone! My name is Alex Polovinkin, and I am Deputy director of the development department at imaga. Over the past 2 years, I have been fortunate to launch two major MVP projects: a car classifieds for Kazakhstan and a telemedicine project. During this time, my team and I have gained a lot of experience in launching similar projects, and we would like to share it. In this article, I will discuss how to avoid mistakes during the MVP stage and what practices are useful to implement right away.

Why MVP is Important

The goal of any MVP (Minimal Viable Product) is to quickly bring a solution to market, reach the maximum audience, and attract customers. The key factors in launching an MVP product are the speed of bringing the product to market.

The MVP must be of high quality. On the project website, users should not see broken layouts and a million errors — otherwise, they will leave the page forever. The same applies to internal systems. A person should be able to go through all the key steps in the system without any problems:

Register;
Log in;
View ads or find a specialist;
See integration with CRM;
Receive SMS or push notifications, etc.

All the main user steps should work without bugs. This is especially true for the payment process. Any bug in the MVP is painful, and there should be few of them. We can overlook some rough edges in the process or manual actions, but everything should work.

That's why an MVP should have a minimal number of viable features. You quickly launch a quality product with limited functionality, which allows the business to test a theory or capture a piece of the market, while also preventing you from burning out due to a 120-hour work week.

What challenges await the team

Vague project requirements at the start.

In the beginning, the client usually does not fully understand what the project will look like, not just in terms of design, but also business logic. They may not know the payment flow, what sections should be included, and what aspects of the initial scope are truly important. Almost always, nuances are lost at the start, and the final scope expands.

The fact is that the business is also evolving. Initial ideas do not always make it to the end. They develop and change in priority. Therefore, new important features periodically arise, without which the project cannot be launched. Sometimes, the opposite happens: a feature that the team struggled with for a week suddenly becomes unnecessary.
Errors in implementing integrations or object models.

It is important to understand that the decisions you make at the start will live with the project for a long time. Changing the object model is difficult, as you do not have time to rewrite all services tied to it. The same goes for architecture. If you make a mistake, you will likely have to live with it until the work is completed.
Hidden issues: network infrastructure problems, database usage not allowed per company policy, sanctions, weaker servers than required, etc.

There can be many non-obvious problems. For example, in a project in Kazakhstan, we were surprised to learn that the local cloud provider did not have Kubernetes. We had to test and deploy it together with them, which took 2 weeks.
Due to the problems above, the initial architecture of solutions often suffers.

Architecture is probably the key factor in your project. How you build the system, which databases you choose, who will be the broker, how services will communicate, and which modules will be used. A mistake at this stage significantly increases the number of hours spent on debugging and refactoring in the future.
Any mistake in CORE is very costly.

By CORE, I mean the modules or services of your system: MDM, IDM, and other abbreviations on which you build service wrappers for various needs, or key packages that integrate into all system services.

The problems above may seem like issues for a manager, team lead, or architect. However, at this stage, the foundation for future developer problems is laid, so understanding all these circumstances is essential.

Main mistakes in MVP

Incorrectly defining the scope of work.

A recent example from practice: at the beginning of the project, we agreed with the client that a review service was not so important at the start of the project because there would not be significant traffic. Therefore, reviews could be postponed. We agreed to leave a simple form where it would be possible to rate the session and write a comment.

Later, a second mistake occurred.
Not fixing the scope of work.
Eventually, this service expanded even before leaving the MVP stage. The ability to choose multiple feedback options was added, each with its own problem checkboxes. As a result, we first implemented a not-so-important feature for the MVP and then expanded it further.

It may seem like a quick task, but there are many small tasks like this. A developer spends a couple of hours on it, a tester tests it for half an hour, front-end developers fix it for an hour and a half, and the team lead validates it in the same time. And suddenly, a 30-minute task consumes 4-6 hours.

When there are 30 such tasks, you will spend weeks on what could have been done without stress and haste.
Incorrectly choosing the architecture.
Spending too much or too little time on specifications.

Sometimes, a massive specification is written for a simple functionality. It won't help development much but will consume a lot of time. On the other hand, sometimes no specifications are given for simple functionality, even though it affects the logic of neighboring services. As a result, we get data inconsistency.

From this list, you can only influence the final scope and its quality. Any developer is a highly intellectual unit, and their opinion should be considered. Do not hesitate to tell the team lead that half of the services can be removed from the MVP because they are not important. Or that a service can be simplified, made worse, but it will work. And after the MVP, you can return to it and refactor.

It is essential to do it with quality and do it quickly. The rest can be caught up.

Packages and Utilities

In this article, I will mainly talk about microservices because monolithic MVPs are simpler. Monolithic platforms are easier to implement, especially when you have limited functionality. However, the market for large products is massively moving towards microservices. This is an understandable trend – scaling and developing a monolith is more challenging.

At the MVP stage, your primary concept should focus on the proper separation of logic and creating a modular monolith that will be easy to disassemble into services in the future. There you will encounter the same problems that will be discussed further.

When starting an MVP on microservices, it is worth considering that they are always more expensive to develop than a monolith. This is related to the number of utilities required for implementation: communication between services, standardization of their work, working with SSO, standardization of logs, setting up general principles of communication through a broker, building unified APIs, distribution, and traffic monitoring, etc.

The second reason is the more complex infrastructure work. For example, we can even run a monolith on Systemctl. With microservices, it's more complicated. At the very least, more time is needed for debugging, launching, and testing the system.

SSO and Internal Requests

Let's take two services: the acquiring service and the shopping cart service. They need to communicate with each other. When checking out a cart on the frontend, the user sends a payment request to the acquiring service. The acquiring service needs to receive it and make sure the cart is correct — that the cost is accurate, etc. This means that an internal network request must occur from the acquiring service to the shopping cart service. The cart responds that everything is correct, and the acquiring service processes the payment.

There is one problem in this simple communication: we need to make an internal network request. To implement this, we need to consider 2 factors:

On whose behalf we are making the request from the cart to the acquiring service — on behalf of the user or the service.
Where the neighboring service is located.

To solve this problem, we can proxy the user's token or create a token for the service. By the way, we'll have to do the latter in any case.

For this, we will need to create a separate client that will extract the user's token, put it in a new request, and send it to another service. At the same time, we need to know where our shopping cart service is located. We can't hardcode it because then the system's logic might break when network settings are changed by administrators.

So, our package needs to solve 3 problems:

Ensuring token proxying,
Ensuring communication between services,
Encapsulating service addresses.

For configuration, we need to obtain the addresses of neighboring services from the outside – from our DevOps. Environment variables that are passed to the container during assembly are perfect for this. Inside the code, we only need to define a settings class and put our envs in it. Pydantic models are great for serializing this data.

Proxying the user's token is a fairly simple task. You need to take the header from the request and transfer it to another one. But what if a background task is being executed, and we need to make an interservice request without the user's token? This is more complicated. You could simply configure internal networks in Kubernetes, which will only be accessible during communication between containers, but your information security specialist is unlikely to allow this.

We decided to implement this mechanism in the following way:

We defined a client for communication with the service — for example, with SSO. It inherits from the base S2S Client, which requires a set of parameters during initialization, such as the service's login and password, as well as additional settings — logger, timeouts, etc.

To optimize the speed of operation, we also decided to store the token directly in memory and work with it directly. If the token has expired or is absent in memory, the client re-authorizes the service in SSO and updates the token. At the same time, we leave the option to save the token in the cache, not in memory.

Inside the S2S Client, the logic for obtaining a token from SSO, updating the token if it has expired, and retries if the service did not respond the first time is encapsulated. Also, this client logs all requests — both successful and with errors.

Now all that remains is to create a new client for communication with the neighboring service, inheriting from the S2S client. In it, we can extend the logic. As a result, it takes significantly less time to create a unified client for communication with any service, the communication rules remain common, the number of errors is reduced, and working with the system is simplified.

Logging

Another question arises here — what if something goes wrong? In this case, we need a log.
But we should not simply set up a standard logger. In microservices, it is good to have the ability to view request tracing from entry to exit:

through which services the request has passed,
how many requests were made to neighboring services,
how long they took to execute, etc.

Unlike monolithic Django, it is not so easy for us to find out where the longest execution time was since there is interservice interaction.

But let's get back to the logs. This is an effective way to understand where an error was made and with what data we received it. Logs are a separate and large topic that can be discussed for a long time: how to set up monitoring, Prometheus, etc. But in this article, I will only present the basic "axioms" of logging.

Logs should be stored in a separate system, which, following the principle of web-workers, will collect logs from stdout. These are systems like ELK, Graylog, etc. All these systems have one common feature — they index logs. For convenient and quick log search, it is necessary for their format to be uniform. Otherwise, you will have to transform them within this system, which is expensive.

In other words, when one service outputs an XML log, another one outputs JSON, and a third one outputs plain text without specific formatting, the best option will always be to record logs in JSON and output JSON strings in stdout. In this case, it will be easy to save, index, and search through them. Often, a log looks something like this:

"Start integration. Time %%%%"
"Integration finished. Time %%%%"
"Integration error: %errors"

It's not bad that some debugging messages are written. But they lack metadata. In this log, we can only see that some integration has started; it started at time N and ended at time X. But what happened between these checkpoints is unclear.

That's why we log all request parameters, response results, and be sure to indicate where the request was going. At the same time, logs are necessarily divided into different complexity levels: Info, Error, etc. Debug logs, by the way, are also quite useful.

The same principle should be used for regular logs that you implement. Replace the standard logger so that it can collect more metadata and convert them into JSON strings. Without logs, debugging the system will be difficult.

In fact, logs are always written in our systems for all interservice interactions. This is also implemented in a separate client, which is the basic component of our SDKs.

If you unfold the init in the S2S client, you can see that several clients are created inside the S2S client during initialization: for communication with SSO and with the service. And these are clients inherited from the BaseClient.

BaseClient, in turn, is responsible for the standard behavior of requests: SSL certificates, retry policy, timeouts, and so on. This is another abstraction over requests that allows for unified communication between services. And another important task it solves is logging messages.

I described one of the ways you can wrap your requests. With each request, a record is made of where and what request was sent. All errors are also logged with exhaustive information about the data used in the request. Both the request and the response are logged.

Then, all this is output to stdout and stored in ELK, and in the future, this log is used for debugging and monitoring.

Auto-tests

We instill a love for auto-tests in all our employees. At imaga, there are almost no Python projects without tests.

In the Django world, there is the Pytest-Django library. It provides convenient tooling primarily for working with databases. Since Django has its own ORM, Pytest-Django allows for deep integration, providing various handy tools out of the box for working with databases, including when running tests with Xdist. Frameworks like Fastapi/Flask/aiohttp do not dictate tools and architecture to us and imply independent choice of various tools to build an application. Unlike Django, it takes a considerable amount of time to write your own test framework. We built our framework with Pytest-Django in mind, so we had to implement the following functionality:

creating a test database;
setting up migrations;
correct transaction handling for rolling back changes in each test;
cloning the database for each worker when running with Xdist;
various convenient JWT token generators with necessary permissions/roles.

Creating a good test framework with all the necessary fixtures and behavior takes time. But in the future, it gives a strong boost to writing tests. Moreover, it is an excellent way to teach the team TDD. Without a good framework, it's difficult, but when it's well-done, even junior-level specialists can create their tests by looking at other tests. Another non-obvious advantage is that debugging with tests is significantly simplified, as reproducing a problem in a test using code becomes easier than manually.

Key points we adhere to when writing tests in a microservice architecture:

Unit tests for key functions and classes of the system.

These are needed primarily for confidence that unnecessary changes won't break all services.
Tests for API endpoints.

This choice is primarily due to the fact that testing the API allows us to test as many application layers as possible. Then, if finances and time permit, each layer can be tested separately.
Deterministic data.

When testing API endpoints, we try to make all input data required for processing user requests as deterministic as possible.
Comparing with reference results.

As a result of the previous point, ideally, we should compare the HTTP status and response text completely, one-to-one. If the tested endpoints start returning more or fewer fields or discrepancies in the format of some fields are found, the tests will immediately signal this to us.
Mocks.

A lot of mocks of responses from various services. It's better to mock service responses rather than classes/functions to test how your clients work.

For example, our standard test looks like this: it loads the main fixtures, where all external requests are mocked. After that, we simulate a request to create a contract with specific data. We check the response status and compare the received data with the expected contract.

There is also a set of universal tests that we copy-paste from service to service:

tests for Healthcheck endpoints;
tests for checking endpoints with different debug information;
tests for endpoints that always "fail" to check Sentry;
ladder migration tests (the idea is borrowed from Alexander Vasin of Yandex): apply one migration, rollback, apply two migrations, rollback, etc.;
tests for checking the correctness of permissions at endpoints.

There are also custom modules for Django/Fastapi that can introspect and extract all existing endpoints and HTTP methods from the application's internals. They check that for each service endpoint, except for a list of specific URLs, the corresponding error is displayed:

when requesting without a JWT token;
when requesting with an incorrect authorization header format;
when requesting with an incorrect JWT token (expired, mismatched signature);
in the absence of access rights.

The last point is the most important, as it helps to understand where we forgot to apply the necessary permission or check the rights correctly. It's especially good that such tests help mitigate human factors or simply highlight to a new person on the team that they did something but did not apply the correct rights check.

In this test, all_routes is a function that returns a list of tuples like ('url', 'method'). Moreover, it generates URLs with correct path_params: for example, for a URL like "/payments/{payment_id:uuid}/", it generates a random UUID. Even if an object with this ID does not exist in the system, it's not that important. What matters is that it is a correct existing URL for the framework, and we won't get a 404 error.

Asynchronous tasks

For inter-service asynchronous communication, we used RabbitMQ. In our case, both Celery and Dramatiq were suitable for background task processing at a high level. Initially, we settled on Celery. However, after that, we noticed Dramatiq and worked with it for some time.

The main differences between Dramatiq and Celery are:

Dramatiq works under Windows;
Middleware can be created for Dramatiq;
Subjectively, Dramatiq's source code is more understandable than Celery's;
Dramatiq supports reloading when the code changes.

However, during the operation of these tools on our stack, they did not perform very well. First of all, both of these tools do not natively support Asyncio. This is a problem when all your code is written for asynchronous operation, and you need to run it from synchronous code.

Of course, you can run it, but we started catching various elusive bugs when working with the database, rare transaction issues, phantom closing connections, etc. Moreover, it turned out that it was not very easy to attach logs of the required format to Celery. It was also not easy to set up proper error alerting in Sentry according to business logic. We do not want to send business exceptions to Sentry, only unexpected ones from Python. Plus, the constructs for running asynchronous code from synchronous looked terrible.

Considering all this, our task code was monstrous, with various hacks and hard-to-debug bugs. Therefore, we wrote our own producer-consumer implementation based on the Aiopika library. Hacks for running asynchronous code disappeared from the code, the ability to add our own Middleware appeared, but for the worker.

And since we can now natively work with Python's asynchrony, our worker can now process not just one task but several at once. It looks more or less the same as it would have been in Celery or Dramatiq.

Moreover, during development, we slightly revised the approach to communication between services through the broker. At the beginning of MVP system development, producer services knew about consumer services and explicitly sent events to each other's queues. At first, this worked well, but as we grew, we realized that producer services knew too much about consumer services and took over part of the business logic when deciding whether to send a message or not.

Therefore, we switched to another approach: producer services' events became simple broadcasts, meaning producers no longer knew about their consumers, and consumer services themselves decided whether to subscribe to these events and, according to business requirements, checked whether to process the event. Since there is only one event, and there can be as many triggers as you like, we also wrote simple job classes of this kind:

Here, is_triggered is a method that returns True/False depending on whether it should be triggered on this event or not. The process method is the business logic of this job.

Next, the list of jobs is passed to a special executor, which checks the necessity of running the job using the is_triggered method and launches the required ones. Perhaps this is not the most elegant solution, but more importantly, we were able to describe the rules for job triggering more beautifully and clearly — without using a bunch of IFs, running jobs in asynchronous mode, and easily adding/removing new ones as needed by the business. And it became easier for developers to understand the code structure and maintain logic.

Recommendations

It is impossible to cover all the issues that need to be addressed in an MVP in a single article. But here are a few points that are definitely worth paying special attention to.

Security checks.
There are always many issues with them; you need to constantly scan the code to prevent critical vulnerabilities from appearing.

Load testing.
It is essential to conduct it. It is difficult to predict in advance how much load your system can withstand. Often, problems arise after exceeding a certain RPS.

S3 storage.
In monolithic architectures, you can always manage files in a separate package. Most often, the file size, allowed extension, file executability check, and limits on the number of file uploads are checked. We also need to check whether the file has expired, clean the storage on time, etc.

In microservices, this package is a separate microservice. This means that you will have to aggregate all the file processing logic in it. This incurs overhead. We need to know which service the file was uploaded from and its metadata. The file has to be uploaded directly to the S3 service. Thus, the microservice, according to its business logic, requires uploading a file (photos, docx, excel, etc.), knows nothing about the file, and only has metadata.

Consequently, we will need a separate asynchronous data synchronization procedure: informing the microservice where the file is located (its URL), the identifier of the file, whether everything is okay with it, etc.

Be sure to use auto-generated documentation.
FastAPI comes with it out of the box. In Django, there is Django-yasg. Ready-made auto-docs save a lot of time for frontend and mobile developers.

Do not neglect typing.
An excellent way to ensure that you are using packages and classes correctly and not making mistakes.

Write automated tests.
Where there is a lot of communication, they are indispensable. For you, this will be protection from extra bugs and an excellent tool for development.

Ask colleagues for help if you are stuck on a question.
It's not shameful; it's necessary. This way, you will complete the task faster, not miss deadlines, and not engage in self-flagellation. Brainstorming is an excellent practice; use it.

Set up Sentry.
It is a simple and powerful tool that is easily set up in standalone mode. Sentry is easy to configure in any framework. Implementing it in a project takes no more than 30 minutes.

Lock library versions.

By default, our projects use Poetry. In it and other dependency managers, there is an option to specify the minimum library version. But there's a small catch. Usually, only the minimum library version required is specified. But this is bad. When there are many libraries, especially popular ones, the chance of catching a package conflict is higher.

If you have any questions, feel free to ask in the comments.

Alex Polovinkin

Deputy Director of the Development Department

Navigating MVP Launches as a Software Engineer: Essential Tips for Success