Today, we are excited to share our experience developing a rather interesting project that we have created and continue to evolve throughout this year — a corporate ERP system.
Situation at the Start
We at IMAGA are engaged in the development of a variety of projects ranging from rigid and slow corporate systems to rapidly evolving digital products. There are waterfall long-term corporate portals, and there are lively, quickly changing projects. Work on some projects spans years, while others are completed within a few months. Sometimes, we find ourselves building a car marketplace in 3 months or developing a Telegram bot over a couple of evenings.
At the pre-sales stage, we strive to approach each project with a high level of detail. We always try to involve technical specialists in the initial meetings to iron out all nuances even before the start of the analysis.
This time around, we encountered a fairly rare situation where we didn't have unfeasible deadlines. It's not wise to rush the pre-project research phase when creating an ERP system, and we had ample time to delve into the subject of requirements and constraints, and design every detail alongside system analysts.
From the very outset, it was clear — it's going to be challenging. Essentially, we were faced with a task — to implement a major ERP system from scratch. The client had neither clear processes (which meant the ERP needed to be highly adaptable to changes) nor a final scope of tasks.
Most of the ERPs we had encountered were built on a monolithic architecture. This time, we decided to propose a SOA (Service Oriented Architecture) to the client. And here's why:
An ERP system requires a meticulous approach to processes within the company where it is being implemented. Its main task is to manage resources and optimize their utilization.
Our client is in the leasing service business, and their company is part of a large corporation. The goals are quite global — extending even to expansions into European markets. KPIs have been established — so preparations needed to be made not just for increased workload (not just moral), but for serious integrations and large complex calculations on the system’s side.
And of course, the choice of architecture and technologies was critical.
The first thing to say: service architecture is not a miraculous panacea or salvation from all troubles. Before choosing it, make sure you're not just chasing the hype around the word "micro".
So why did we choose SOA?
Unadjusted Processes at the Client’s Side
Just to clarify, our main stack is Python. Almost the entire backend is written in it. But that's beside the point.
Our client is a relatively young company, which can be referred to as a kind of startup within a large holding. At the time of our acquaintance, their legal entity wasn't even a year old. The core team had already been formed, but it was necessary to work through all the processes at all levels of service. This required time and reconfiguration of processes on the client's side.
For us, this meant that we needed to approach the changing requirements with maximum flexibility and carry out systemic analysis extremely meticulously.
As far as analysis is concerned, we were not afraid of any tasks, but changing requirements could significantly complicate the development of a monolithic system.
The architecture was built based on an event model, where data between various system components were passed in a certain format and were invoked from various services. There was no clear gradation of which service should receive which data. When creating an event in system A, it could immediately go to B, C, and D.
In a similar manner, we built financial accounting in the system. Only one service facilitated integration with accounting. And every service that needed to pass financial information created an event, which asynchronously generated a similar event in the financial service.
This approach to architecture allows for adding or changing any services in the system without any complexities. The tools interact with each other seamlessly and integrate easily into the overall financial system.
Integration in monolithic systems is quite a trivial matter, but it comes with its own set of challenges. These are mostly related to the fact that sometimes we have to integrate with different versions of libraries. For instance, one of the CRM services communicated on the oData v4 protocol, while another CRM service was on oData v2. In the case of a monolith, this would have posed difficulties. We would have needed to deploy two different versions of containers with different sets of packages. And what if the language packages start conflicting with each other? Debugging and development become more complicated, and the number of dependencies in the system increases.
SOA simplifies this issue — we can aggregate all the data serialization logic and work with a specific protocol within a service. Only the necessary API will be available to external consumers.
Security also comes into play here. We cannot allow all services to have external internet access. So, we simply deployed a service, granted it internet access, aggregated logic for working with an external service within it, and designated an API for internal services.
For the sake of security, a firewall was set up for this service, more stringent rules for static code analysis were written, and more sensitive monitoring was put in place.
We are not just talking about the technical scaling of server resources. Everyone understands that in SOA, it's much easier to manage traffic and deploy more containers for the service experiencing degradation. But the aspect of team scaling is also crucial.
Utilizing a service architecture implies clear boundaries for the logic of each service. This, in turn, allows lowering the entry threshold into the service and easily bringing new developers up to speed.
For example: it is much easier to understand how a geoservice with a small set of APIs, a couple of asynchronous methods, and 5-10 database models is organized, than to find a similar module in a monolith and understand how and where it is used in other modules.
If services are wisely broken down, their development can be easily parallelized, and documentation simplified. Describing the logic of a small service is easier. Its logic is strictly defined, and accidental inclusion of some components from other modules is much less likely, with much less interconnectivity.
In our practice, we have had to do awful things — copying entire methods, slightly changing their names, and adding a few extra lines of code. This was done because these methods/classes were used in a large number of APIs and asynchronous tasks. Any intervention could lead to unforeseen results of one or another logical operation. Yes, there is a remedy for this, although not 100% effective — automated tests. But everyone knows the world we live in — few are willing to spend 8 hours writing tests for the sake of adding one logical condition to a method. Of course, there are other ways, but they are quite extensive in terms of implementation time. Undoubtedly, technical debt tickets and TODOs in the code are already waiting for their hero, and right now we will deal with the small stuff and definitely get to them, yes…
With a service-oriented approach, this problem is minimized. You either create a common service, which works the same for everyone and provides a versioned API, or implement a common package, the necessary version of which is used by this or that service.
What Are the Advantages of ERP Services?
What has the implementation of SOA architecture in ERP given us? Probably, a lot of headaches more than anything. But of course, there was also a significant profit. SOA looks very attractive, but it requires a meticulous approach to working out the details. Each item on the "why SOA" list has its set of issues and complexities. However, the positive effect overwhelmingly outweighs them.
Among the main advantages we gained is flexibility in process creation. Our system is ready to change with the company as quickly as business processes do. In the first six months of development, we managed to rewrite several services twice, as business processes or the very concept of the service changed before they went into production. Except for the user interface creation stage, the rework process took significantly less time than developing the same service on a monolith. We were detached from the other system components, so we had no collisions or intersections with other modules.
The system turned out to be quite extensive but at the same time succinct. We are still developing and improving it, but it's already evident that understanding a specific service is much easier than grasping what's happening in the entire ERP. We even assigned some juniors to some of these services, and they are handling the tasks set before them quite successfully.
We were able to try out several new frameworks, testing new solutions in specific small and isolated services. This opportunity allowed us to test them in production and understand where the pitfalls lie, where the path on StackOverflow isn’t well-trodden yet, and where bugs have not been fixed yet. Thanks to the compactness and isolation, we managed to work around bugs and "jury-rig" some solutions. Such experimentation allowed us to use these frameworks with confidence in more serious services.
The Problems We Encountered
The most important and problematic component. You won’t survive if there isn’t a strong architect in the team. Architecture solves everything. Any minor thing at the architecture stage can lead to days of rework and corrections not of one, but a whole array of services.
Everything is important: the protocol of interaction, the object model in each service, the distribution of responsibilities among services, the choice of databases, and much more. Everything must be documented, described, and illustrated. It’s crucial to stumble as softly as possible at this stage (there are no perfect solutions).
The cost of mistakes at different stages of development:
One of the difficulties turned out to be the habit of covering all code not only with tests but also with documentation. Living without specifications in such systems is quite challenging. After some time, we started to notice that we simply were unable to recall the implementation of this or that component of the system. This problem is common to everyone who designs and deals with service architecture. Without good documentation, the system will not be able to maintain integrity and encapsulation.
Developers also sometimes struggle to quickly recall why this class or method is implemented this way and not otherwise. Since the system directly implements business processes, avoiding specific quirks in the code is impossible. Why, for instance, does this method have a threshold value after which the logic drastically changes? Question... And such questions arise at every step.
Many of them were solved by the documentation process. We describe the basic schemes of services, draw flow diagrams, describe the status and object model, and provide a brief annotation for each service. In the code, we detail what each test tests, and provide comments on dubious points in the code. Of course, every service has a swagger auto-generator, and each service provides it at its own URL, closed with authorization and access privileges.
The screenshots are deliberately blurred due to the NDA. But the essence should be clear.
The project team is expanding quite quickly, and turnover and replacement are inevitable. If initially, we spent about half a day explaining to a new developer why this system is needed, what each of the services does, and how they communicate with each other, now, it's enough for us to explain only the business component and send the developer to read the documentation and study the code. Of course, the documentation will not provide a 100% understanding of the project, questions will naturally remain, but the developer will understand 80% of the project.
With infrastructure, it's not as simple as it seems. It's important to remember that any incorrectly configured API can lead to problems. For instance, you expect one service to transmit reference information via API for display, and the service suddenly fails for some reason. In this case, errors should not occur; the system should be able to handle such situations. Services should be as independent as possible and should always work, even if some crucial component of the system stops responding.
This requires special manipulations with the API, plus a more meticulous approach to the data serialization stage.
The same applies to data synchronization between services. If data has changed somewhere in the system, it's likely to require data changes in another service that utilized it. For this purpose, we use the RabbitMQ message broker. In the future, if the system expands, we plan to integrate Kafka. Kafka is a more powerful broker, supporting a colossal amount of transactions. However, its use requires more rigorous support, as its configuration and implementation are much more complex than RabbitMQ. In our case, the number of transactions and integrations has not yet reached a level to integrate it into the infrastructure. We decided to reduce the development cost, being 100% sure that the RabbitMQ cluster will handle the upcoming loads. We wouldn't recommend connecting Kafka right away unless there are solid arguments in its favor.
Furthermore, a simplified infrastructure on docker or docker-compose will not suit the service architecture. You won't be able to organize enough flexibility and versatility of the solution. It's necessary to set up an orchestrator (kubernetes or its analogue) and maintain a special DevOps support team, as backend/frontend developers lack the required specializations in DevOps and won't be able to provide an adequate level of support.
As the system expands, it becomes increasingly difficult to understand which interface begins to experience degradation. It's necessary to extend monitoring, keep an eye on each service and the traffic that passes through the infrastructure nodes. For this, additional systems such as Prometheus, Zabbix, and Elk have to be connected. It's required to set threshold values, ensure there's no delay in API requests between systems, and be sure that messages are not stuck in the broker, as this affects the system's availability. In our case, we monitor to ensure that requests between services do not exceed a certain response time limit. Each service has its own limit, but they should not exceed 300 ms. The value is empirical — until the system has reached full capacity, we do not conduct additional optimizations. Hence, the threshold is quite high.
Each service has a soft-delete mechanism, ensuring that no object in the services will be physically removed from the database.
We also monitor the volume of network traffic to predict its degradation. All requests in the system are monitored, changes on SSO are checked, all file modifications are logged, as well as object changes on each service. And of course, standard parameters like disk read/write, database load, CPU, etc., are also monitored.
All of this requires additional resources for system configuration, log collection, infographic alignment, and monitoring of individual critical nodes.
Services communicate with each other continuously, and it's necessary to ensure security and role segregation when one service doesn't have access to another. The classic solution of issuing an authorization "cookie" and carrying it in every request is quite inconvenient. In this scenario, at least one additional request has to be made to the authorization service to check the token with every primary request. Plus, there's a constant need to load user data into interfaces and verify their access rights to resources.
To optimize this aspect, we implemented an SSO service based on JWT tokens. Let's briefly discuss how they work.
JWT has three main blocks: header, payload, signature. The header describes how the JWT token should be calculated, the payload carries useful data, and the signature encrypts the data and signs it with a secret key. Thanks to the cryptographic signature, the JWT token remains valid throughout its lifespan, ensuring that the data in the payload hasn't been compromised.
The JWT itself appears as follows:
The example is taken from the official website.
This allows us to save resources — by placing useful information inside the payload. In this case, we won't have to constantly request user privileges from the SSO service, and additionally, we can store user metadata (for instance, full name) directly in the token.
Services must operate quickly. Unlike monolithic architecture, services constantly communicate with each other, leading to a lot of network communications. This sets certain boundaries — the use of synchronous frameworks becomes quite challenging.
For example, if you have a service that holds a lot of meta-information (foreign keys), assembling the object entirely would require data from three or four services. In this case, synchronous frameworks would operate quite slowly.
Each additional request leads to additional waiting, and if the requests are not delegated to separate threads, the waiting time on the frontend will be immense.
Due to the amount of traffic, infrastructure issues are as significant as the implementation of the services themselves. As I already mentioned, the amount of traffic in the services is substantially higher than in a monolithic implementation. You may hit the bandwidth limit, network delays can significantly affect the operation of some services. The cost of traffic also holds a considerable share, especially if your system is located in multiple data centers.
In this project, we haven't encountered it yet, but the problem is already looming on the horizon. We won't pay much attention to it. We haven't achieved anything interesting in solving this theorem on the project yet. Perhaps later, we will release a separate article regarding its solution.
Overall, with the chosen approach, security begins to play a more crucial role. The system becomes disintegrated, and more points emerge where leakage might occur. In our company, a regulation for developing such products has been established, and adhering to it is a crucial condition for the system's development, with all commits being checked by a static analyzer. This leads to additional expenses during development, but it's a mandatory condition for the development of complex corporate products.
The aforementioned requirements pertain to SSO, API, frontend interfaces, infrastructure, as well as file storage operations. It's essential to remember that any file stored in the system must be checked by antivirus and should reside in an isolated storage that cannot be accessed through any service.
The primary requirements, which are not used ubiquitously, include static code analyzer scanning, checking for a large number of classical vulnerabilities, regulations for database access provision, and the absence of root users on servers. This list is quite extensive.
Developing a service architecture for an ERP system is an extremely interesting task that plunges the team into a whirlpool of challenges, yet it serves as a powerful booster for professional development. We managed to build a quite versatile system that, above all, satisfies the client, and of which we are rightfully proud.
We continue to evolve, establishing new rules and approaches to development. Already, we have several SDKs that help standardize service development. We've identified a technology stack that we aim to work with moving forward and implemented common patterns to follow.
Currently, we have implemented over 15 services, five SDK libraries, and three framework scaffolds with common modules (Django, Aiohttp, Fastapi).
We are developing services without breaking them down into ultra-compact microservices, as there's no need for this at the moment. The main rule for us is to aggregate business logic within each service.
We used Django in the early stages because we needed a fast admin panel for object management. Soon, work will begin on implementing a custom admin interface common to the entire ERP; hence, the days of the current one are numbered. Django is a wonderful framework, but it's much slower compared to asynchronous Fastapi.
All new services are now being developed on Fastapi, as we liked it more than Aiohttp.
Our plans include implementing business analytics services, setting up auxiliary integration microservices and data serialization, and much more.
Such is this complex, lengthy, multi-step, and fully nuanced work. Have you had similar projects? What problems did you encounter? What was the most challenging? Comment about it.