Drahan
GM
Posts: 2147
Joined: 02-06-17
Last post: 309 days
Last view: 37 days
Hello Milletians!
Recently, you may have noticed that recently the server has had a steep decline in stability.
We take server stability very seriously, and it's a real shame that we aren't able to fulfill our goals for stability.
We know these crashes are extremely frustrating for our players, especially those who lose items during the crashes. Trust me, they are extremely frustrating for us too. It seems like every time I look away for any extended period of time, something has gone wrong.
Today, I am going to write about the each individual crash, why they happened, and how the team is planning to solve these problems in the future.
Server crash explanations
In the past 2 weeks, the server has crashed 5 times. This means, during a working day there is a 50% chance that the server will crash.
A 50% crash rate is certainly not something we are proud of.
Here is a list of the last 5 crashes, and their causes.
1: Unexpected server maintenance (to patch Spectre and Meltdown). Our server host did not notify us in advance about this.
2: Database server stalled due to resource exhaustion.
3: Login server stalled due to resource exhaustion.
4: Database server stalled due to resource exhaustion.
5: Unexpected Windows OS level error, probably caused by resource exhaustion but not verified.
Pet lag (or rather, the sinister database stall)
You may have experienced database server stalling first hand - it's what is known as "pet lag". This "pet lag" is much more sinister than it may look at first glance.
The reason people call it pet lag is because summon times take an extremely long time to finish, the cause of this is the database server failing to respond to queries in a timely manner.
This lag also affects logging in, saving bank info, changing channels, and basically anything that touches the database.
The game server has a very aggressive cache on player data, so you don't notice it while playing until you summon a pet usually.
If pet lag is preset, the server is on a downward spiral towards an imminent full crash - where you will not be able to play at all.
Luckily, pet lag takes a long time to fully manifest into a full crash, so you may notice we do not restart the server until it gets extremely bad (and usually right before the full crash).
Unfortunately, this causes players to have a less than desirable experience - but it's our only choice. If we immediately restart the server every time the database starts to stall, we would be restarting the server every few days. We choose the lesser of two evils.
In the event of a full database server crash, you will be unable to login (it will stay on the login screen for a very long time and never finish).
Additionally, all unsaved player data is lost permanently and rolled back to a previous state.
Login server stalling
This is simply caused by extreme resource exhaustion. Mabinogi services do not recover well from failure, and under immense pressure, servers will occasionally lose connectivity with eachother.
The moment the login server loses connectivity with the server coordinator or database server is when it's stalled until manual reboot.
You can tell this has happened when you attempt to authenticate and you're immediately met with a "Unable to connect to server" error. The login server will not accept new connections when stalled.
Windows errors
We're using Windows 2003 for the game server. Not much more to say here. The software is outdated and occasionally misbehaves, especially under high load.
We use this because the Mabinogi server we use (G13 specifically) was designed for Win 2003 - it isn't very stable in other versions of Windows. Just like the Mabinogi client isn't very compatible with Win8/Win10, the server's got the exact same problems (random freezes, etc)!
Again, we choose the lesser of two evils - unstable server software or unstable operating system. It just so happens that the operating system as a whole crashes less than unstable server software.
This particular problem, though rare, is actually relatively simple to solve (in theory) compared to the other problems: just use a more stable windows version.
We are able to manually fix each crash we encounter on updated versions of Windows server, which we have done in beta testing - but we're never fully confident in that work, never really sure that we haven't missed some small case in the hundreds of thousands of lines of assembly code that we have to work with to fix such problems.
A common theme
You may have noticed a common theme by now..resource exhaustion. 4/5 of our server crashes were from resource exhaustion.
You're probably thinking, "Drahan, why exactly does MabiPro suffer from resource exhaustion in the first place?".
It's a good question - especially when you know how the Mabinogi server scales. It does not scale by player count or NPC count...it scales by NPC client count.
More players does not scale linearly to more resource usage.
Our actual resource usage, for the most part, is entirely static. It almost never changes.
Under this logic, there is absolutely no reason that we should be suffering from resource exhaustion in the first place..
So why?
It all boils down to one thing: the way we host and finance the server.
MabiPro is actually under an extremely low budget, and as such, we cut corners financially...a LOT.
The biggest corner we cut is our dedicated server hosting.
To cut dedicated server costs, we've got connections with a server hosting coalition that shall not be named; and we cut a deal with them.
Not only do we get an extreme discount for dedicated shares of their servers (much cheaper than what we can buy publicly), we are also allowed to use shares of the server that are unused, as in not being used by paying customers currently, free of charge.
We're using these unused server shares to power almost 80-90% of our server.
(Those of who you are reading who are knowledgeable in the server hosting industry - no, we're not technically using a dedicated server, it's just easier to explain it that way)
What's the catch?
You see, nothing is really free in this world. We save a lot of money from this deal, but it comes at a cost: server stability.
The problem is when a paying customer decides that they want to use the "unused" resources (read as, the resources MabiPro is using) that they deserve.
These resources are ganked away from MabiPro instantly, and allocated to that paying customer until they're done using it.
After they're done using it, we gain those resources again.
(For those with knowledgeable in the server hosting industry - yes, this is a form of overselling. It isn't good.)
However, Mabinogi does not respond kindly to it's resources being forcefully taken away.
This process will instantly make MabiPro crash for one reason or another.
Progression of the problem...
The more paying customers grow, the more often we get our resources (that we aren't paying for) forcefully taken away.
In the recent weeks, this has grown to a point where it may happen every other day due to simple growth of the server host.
In other words: we need to solve this problem, and we need to solve it fast.
The Solution
Really, the solution to a large majority of our problems is so simple, any of you could figure it out.
We just need to rent a new dedicated server, really.
This is what we plan to do in near future.
This will increase our operational cost by a significant amount, and for the most part we pay for the server out of pocket.
Fortunately, we have a good sum of Bitcoin donations (thanks to our generous community) that we have kept for a long while - we're going to use these funds to pay for the extra expenses incurred by renting a new server.
If we decide to go through with this plan, we will need continued support from the community in the long run.
We will try our best not to let you guys down, as long as you don't let us down.
Hopefully you enjoyed my large wall of text.
-GM Drahan