Here is another challenge from the Bime project I talked about in previous entry “routing datasource”. Basically we have two systems “Bime Controller” and “Bime”. “Bime Controller” is the main web application where the clients (research labs) can register to create their own sub-domain Bime site; and “Bime” is the web application to manage the lab activities. Every Bime instance can serve multiple clients (20), each of which has its own MySql database. We are leveraging the Spring routing data-source to route the persistence layer to use correct data-source.
The challenge is we need to scale the system horizonally – for every new client registered on Bime-Controller, we will need either find a Bime instance which has spare capacity or create a new EC2 instance for the new client. In general, it is sort of a task management system implemented in cloud environments. There are open source and commercial frameworks to help to scale in clouds, but I am interested to have my own system which can deal with issues specific to my projects.
My approach is
1, at Bime-Controller side, there is a internal queue “Task Queue” which to hold all the incoming task requests (for example, to hold a new registered client)
2, at Bime-Controller side, for each “worker instance”, for example, Bime EC2 instance, it maintains a small “engagement queue”, which is to hold tasks assigned to the worker. This queue has a defined maximum capacity for example 20.
3, Bime-Controller has RPC service to talk with Bime instances. For example, when Bime instance starts up, it will report to Bime-Controller two things – first is “who am I”, second are all the clients it is already configured to serve – lab1, lab2, lab3….and Bime Controller will create the engagement queues for it. When client cancel their registration, Bime will also send an update even to controller.
4, Bime instance has RPC service too, for Controller to send its commands, for example, prepare for a new client (lab).
5, For every incoming request, Bime Controller will first consult all the engagement queues to see which one has spare capacity. If it finds one, it will send please-prepare-for-client-lab24 message to the Bime instance (and return without blocking); otherwise, it will create a new engagement queue, put the request in the queue, and issue command to create a new EC2 instance (and pass the engagement queue name to EC2 through User Data).
6, It takes a while for the new EC2 to start up, all the incoming requests during this period of time will be put in the newly created engagement queue (if under capacity) or another new engagement queues.
7, When new Bime instance starts, it reports to controller with its name and clients – in this case, Controller sees it can add more clients and pass it the requests in its engagement queue;
8, At Bime instance, after the client lab is set up successfully, it will report to Controller (and Controller will display to end user, through Comet)
In general, the communications are
From Bime to Controller – reports its status at starts up; after new client done; after client removed.
From Controller to Bime – prepare new client; remove existing client; Other inquiries.
[ Bime Controller Web App ] [Bime Web App]
[Controller Task RPC] [Worker Task RPC]
The RPC service at the Controller side provides below calls
1, Response addNewTask(1: Task task) Called by Bime-Controller web app to add new client
2, Response removeTask(1: Task task) Called by Bime-Controller web app to remove task or client
2, Response updateWorkerStatus(1: Woker worker) Called by Bime RPC to update Bime worker status
3, Response updateTaskStatus(1: Worker worker, 2: Task task) Called by Bime RPC to update task status
The RPC service at the Worker side
1, Response addNewTask(1: Task task) Called by the Controller RPC to add new client
2, Response removeTask(1: Task task) Called by controller RPC to remove existing client
3, Response updateTaskStatus(1: Task task) Called by the Bime Web App to respond to the new client request
4, Response updateWorkerStatus(1: Worker worker) Called by the Bime web app to report status