Evolution of high-concurrency IO system architecture

NOTE
All sensitive information has been removed, and some corporate technologies are not open-source, only non-confidential portions are described as examples within the business context.

Business Background#

The enterprise platform (parent system) aims to provide an open competitive and learning environment for individuals and students engaged in artificial intelligence research, promoting the practical application and innovation of AI technology. The platform supports various competition formats, covering topics such as machine learning, deep learning, and natural language processing. Participants can submit models and code online, and the system automatically evaluates results and generates real-time feedback. The platform also includes data management and archiving functions, supporting efficient storage of both active and historical data to handle the challenges of large-scale users and data volumes. Through a layered storage architecture (e.g., a combination of MongoDB and MySQL), the platform optimizes performance while ensuring data security and long-term accessibility, providing reliable technical support for competition activities.

The cloud resource management system, as a subsystem of the parent system, is responsible for scheduling IO operations between the parent and child systems and dynamically allocating and managing computing resources for participants. It ensures efficient and stable resource support through features like auto-scaling, resource monitoring, security control, and cost accounting, helping participants complete model training and testing in high-performance environments.

Legacy Architecture Evolution#

As the number of participants in each competition increased, the platform’s performance in processing leaderboards began to decline. To quickly address performance bottlenecks, vertical hardware scaling and horizontal KVM cluster scaling were prioritized to relieve the workload on individual machines. Over time, the platform continued exploring better architectural patterns to replace these temporary solutions of vertical and horizontal scaling.

Vertical Hardware Scaling#

In the legacy architecture, the overall system was evaluated as a medium-load application with a longer lifecycle and infrequent object creation and destruction. Therefore, the JVM parameters were set as follows:

-Xms initial heap size: 512MB
-Xmx maximum heap size: 4G
- Avoid triggering Heap Space OOM during leaderboard processing.
-Xmn young generation size: 2G
- Prevent new thread creation issues during leaderboard processing.
-XX:-UseLargePages disable large pages
- Avoid excessive memory contention during leaderboard processing.

When scaling hardware, vertical scaling was an important direction. By improving the compute, storage, and network capabilities of individual nodes, the system’s performance and reliability requirements could be more efficiently met. Early on, the platform prioritized upgrading hardware configurations, including adding more CPU cores, improving memory capacity and frequency, and selecting higher-performance NVMe SSD storage devices. These upgrades significantly improved the server’s processing power, enabling it to handle more complex model training tasks, store larger datasets, and quickly respond to user requests. On the network layer, optimizing the server’s NIC performance and network bandwidth improved data transfer rates and reduced task data transfer latency. For example, configuring 10Gb or higher network cards and optimizing the network topology ensured efficient data flow, which was crucial for real-time data uploads and result returns during competition tasks.

During hardware upgrades, hardware utilization needed to be considered. The improvement in hardware performance should match the system’s workload to avoid resource wastage caused by over-provisioning. For tasks with high concurrency, high-performance multi-threaded CPUs were more appropriate; for tasks dealing with massive amounts of data, high memory and fast storage devices were critical. Therefore, hardware selection and expansion needed to align with the platform’s actual business characteristics and workload types. Vertical scaling also needed to be evaluated for return on investment. The performance limit of single-node hardware is not infinitely scalable, and as performance bottlenecks emerge, costs might increase exponentially while performance improvements level off. Hence, a reasonable expansion boundary had to be determined to avoid pursuing hardware performance without considering cost efficiency. Vertical scaling also required operating system and software stack optimization, such as adjusting system kernel parameters and using efficient scheduling algorithms, to fully leverage hardware performance.

Vertical scaling, as one means of performance improvement, is not standalone. In practice, it needs to be combined with horizontal scaling. As the platform scales, distributed architecture and load balancing techniques help distribute the load across multiple nodes. This not only maintains performance but also enhances the system’s fault tolerance and reliability. Vertical scaling is a comprehensive improvement of hardware performance and a deep optimization of platform architecture design. Through a reasonable scaling strategy, the platform can efficiently meet competition task demands with the support of hardware resources, while laying a solid foundation for future growth.

Horizontal Scaling Clusters#

Horizontal scaling expands system computation and storage capacity by increasing the number of servers. In the competition platform, the horizontal scaling solution mainly involves: deploying more physical or virtual servers to distribute tasks across multiple nodes; using load balancers to distribute tasks and avoid single-point bottlenecks; and employing distributed storage and computation frameworks to enhance overall processing capacity and data handling efficiency. This approach quickly increases the system’s overall capacity, adapting to growing competition demands and user numbers.

The temporary advantage of horizontal scaling is its ability to significantly enhance the system’s concurrency capacity and fault tolerance. In high-concurrency competition scenarios, new nodes can quickly relieve pressure, ensuring service stability. Additionally, horizontal scaling allows for elastic scaling, dynamically adjusting resource allocation based on task load variations, thus improving resource utilization and reducing operating costs. Moreover, distributed design enhances system availability: if individual nodes fail, the overall service continues to run smoothly.

Legacy Issues#

Despite the benefits of vertical and horizontal scaling, their ability to fundamentally solve resource management issues is limited:

Memory Usage of Competition Resources: As competition tasks and user numbers grow, the amount of historical data and models continues to increase. Even with hardware performance upgrades or more nodes, memory and storage usage problems persist. The long-term accumulation of these resources gradually erodes the system’s processing power.
Program Execution Overhead: Tasks like model training and data preprocessing have high computational complexity. In scenarios with many concurrent users, system efficiency suffers. Relying solely on the central cluster to process all tasks leads to thread exhaustion, resource contention, and other issues that degrade performance.
IO and CPU Task Competition: The central cluster handles both large IO operations and CPU-intensive tasks simultaneously, which can cause thread pool saturation, task delays, and even system bottlenecks. These problems cannot be solved by simply scaling hardware or adding nodes, as they stem from fundamental issues in resource allocation and task scheduling.

To address these challenges, the competition platform needs to adopt KVM-based local program design, offloading some computation and IO tasks to virtual machines. This resource separation reduces the central cluster’s load. Virtual machines run independently while sending results back to the central system with low latency, making task division more efficient. This design not only optimizes resource usage but also provides flexibility for long-term system expansion.

Exploring New Architecture#

Current Situation Analysis#

The root cause of these issues lies in the central cluster’s single architecture, uneven resource allocation, and insufficient task scheduling capacity. A new architecture, centered on a layered distributed design, introduces edge clusters that run independently and a short-connection mechanism, which effectively alleviates the load on the central cluster. Additionally, task offloading and resource optimization allow the system to handle concurrent tasks more efficiently while reducing the risk of single points of failure. These designs solve the competition between storage and computation resources and lay the foundation for the platform’s scalability.

Growing Memory Usage#

As competition tasks and user numbers grow, the storage requirements for historical data and models continue to increase. This leads to significant memory consumption, affecting the execution efficiency of other real-time tasks. In the traditional architecture, the central cluster handles both real-time tasks and the storage/maintenance of historical data, creating resource contention that gradually becomes a performance bottleneck. To optimize resource use, historical data could be stored in layers at the edge nodes, migrating low-frequency access data to the edge clusters to ease the memory pressure on the central cluster. Additionally, edge nodes could offer distributed caching and preprocessing functions to further optimize system performance and reduce user access latency.

Program Execution Overhead#

Tasks like model training and data preprocessing are computationally intensive. When the number of concurrent users increases, the system often experiences thread exhaustion. This is mainly due to the central cluster coordinating multiple tasks with limited thread pool resources and scheduling capabilities. To address this, computationally intensive tasks can be offloaded to edge nodes, which, with their flexibility in multi-threaded out-of-order reading, can reduce the scheduling burden on the central cluster. Furthermore, by dynamically adjusting thread allocation strategies based on task priority, edge nodes can achieve better concurrency and resource utilization.

IO and CPU Task Competition#

The central cluster needs to handle both IO-intensive and CPU-intensive tasks simultaneously, and their resource demands are competitive. This competition can lead to thread pool saturation, task delays, and even system crashes. For example, when the central cluster responds to large-scale user requests while performing high-frequency model updates, IO delays can slow down overall task progress. To resolve this, a short-connection mechanism can optimize communication between the central and edge clusters, reducing IO overhead. Additionally, edge nodes can independently handle some tasks, allowing the central cluster to focus on high-priority computation tasks and critical data integration, avoiding performance degradation due to resource contention.

Proposed Draft#

The new architecture aims to alleviate the performance bottlenecks caused by the growing IO task pressure on the central cluster. By distributing tasks in a layered, distributed manner, separating the responsibilities of the central cluster and edge nodes, it improves the overall performance and stability of the system. The proposed draft includes:

Edge Processing of IO Tasks: Offload most IO-intensive tasks to virtual machines (KVMs). These VMs run as independent nodes, handling assigned IO tasks and returning results through a dedicated communication mechanism. This design reduces the direct IO load on the central cluster, allowing it to focus on coordination and data integration tasks.
Decentralized Management of Edge Clusters: Edge nodes function as a distributed cluster, independently handling task reception, execution, result storage, and reporting. This decentralized approach improves system scalability, reduces the risk of single-point failure, and optimizes resource utilization.
Efficient Task Scheduling and Resource Allocation: Task scheduling and resource allocation should be dynamically managed between the central cluster and edge nodes. By leveraging KVMs and virtualized resources, tasks can be more efficiently assigned based on resource availability, reducing task delays and improving concurrency.
Central Cluster and Edge Nodes Short-Connection Communication: Using a short-connection mechanism between the central and edge clusters minimizes the IO overhead of task reporting and reduces latency between nodes. This approach reduces resource contention and improves task throughput.

Expected Results:

Central Cluster Load Optimization: Significantly reduces the IO task processing pressure on the central cluster, freeing up more resources for core computation and coordination tasks.
Efficient Use of Edge Nodes: Through decentralized management and multi-thread optimization, edge nodes can efficiently complete tasks and quickly respond to changes in demand.
Improved Communication Performance: The short-connection design reduces network congestion and timeout issues, enabling more efficient data exchange.
Enhanced System Scalability: The distributed design of edge nodes provides greater flexibility and adaptability for future system expansion and the addition of new task types.

Implementation#

To effectively implement the edge-based IO task processing and result callback mechanism, a complete business implementation process can be gradually constructed. This process involves building edge services based on lightweight frameworks like Quarkus, and creating an efficient task scheduling and processing system in collaboration with the central cluster.

Deployment and Functionality of Edge Node Services Each edge cluster node will deploy a lightweight service based on Quarkus. These services are primarily responsible for the following tasks:
1. Task Subscription and Listening: The central cluster interacts with edge nodes through a publish-subscribe model. Edge nodes register with the central cluster upon startup, indicating their ability to accept specified types of IO tasks. Registration information includes node ID, supported task types, current load status, etc.
2. IO Task Processing Logic: Upon receiving an IO task event from the central cluster, the edge service executes the corresponding operations based on the task content. For example, handling file uploads, log analysis, or model data read/write. Quarkus’s event-driven model helps efficiently process these asynchronous tasks, avoiding thread blocking.
3. Result Packaging and Callback Interface: After task completion, the edge service packages the results into a standardized format, such as including task ID, execution status (success/failure), and a summary of the processed data. The packaged result is asynchronously sent back to the central cluster via a short-connection interface.
Task Scheduling and Callback Management of the Central Cluster The core task of the central cluster is to coordinate the work of multiple edge nodes. Its main functions include:
1. Task Distribution: The central cluster distributes IO tasks to appropriate edge nodes based on business needs and the load status of edge nodes. The distribution uses a scheduling algorithm, considering factors like the node’s task queue length, resource availability, and task priority, to achieve load balancing.
2. Task Queue Management: To avoid task delays or loss due to busy edge nodes after task distribution, the central cluster maintains a status queue for each task. Tasks are monitored until completion, and those not callbacked on time can be rescheduled to other idle nodes.
3. Result Reception and Processing: When edge nodes submit processed results via the callback interface, the central cluster receives and updates the task status. For successful tasks, the central cluster archives the results or passes them to subsequent business processing modules. For failed tasks, the central cluster decides whether to retry or log the failure for later analysis, based on predefined policies.
Implementation Details of the Publish-Subscribe Model
1. Event Publishing Mechanism: The central cluster uses a message queue or event streaming system (e.g., Kafka, RabbitMQ) to publish IO tasks to the edge cluster. The task information includes event ID, task content, target node, etc. Edge nodes subscribe to the relevant task types and pull tasks from the message queue.
2. Dynamic Registration of Edge Nodes: Upon startup, edge nodes send a registration request to the central cluster to indicate their availability. The central cluster maintains a dynamic registry that updates the health status and load information of each node in real-time, which is used as the basis for task distribution decisions.
Asynchronous Callback Mechanism To avoid resource waste caused by blocking waiting on edge tasks, the system uses an asynchronous callback mechanism:
1. Short-Connection Callback Design: After completing a task, edge nodes use HTTP short connections to call the central cluster’s result reception interface. The interface design follows the idempotency principle, ensuring that multiple callbacks due to network interruptions or other issues will not affect the task processing logic of the central cluster.
2. Result Validation and Archiving: After receiving the callback result, the central cluster verifies the data integrity and task status. If validation passes, the results are stored in the database or distributed storage, and the task is marked as complete. Failed callback results trigger exception handling mechanisms, such as logging or notifying the administrator.
Decentralization and Collaborative Optimization of IO Tasks Through the above mechanisms, the system implements the decentralization of IO task processing. The edge cluster handles a large number of low-priority IO tasks, freeing up resources in the central cluster to focus on higher-priority computational tasks. The collaboration between the central and edge clusters is achieved through asynchronous task distribution and result callback, preventing resource waste while improving the overall response speed and processing efficiency of the system.

This architecture reduces the burden on the central cluster, fully utilizes the computational power of edge nodes, and achieves efficient scheduling and processing of IO tasks.

Conclusion#

The proposed layered distributed architecture, incorporating edge nodes, offloads IO tasks, and provides a short-connection mechanism to improve resource allocation and reduce competition for resources. This architecture increases system efficiency, scalability, and flexibility, allowing the platform to handle growing competition demands. It also optimizes memory, computation, and IO resource utilization, enabling better real-time performance while providing the foundation for future system growth and expansion.