GitLab goes down after tired system admin deletes the wrong files

Posted on Posted in 2016 Business Opportunities For You, What's Trending Now

 

GitLab, a startup with $25 million in funding, is having a very bad day after a series of human errors caused the service to go down overnight.
 
Basically, GitLab provides a virtual workspace for programmers to work on their code together, merging individual projects into a cohesive whole. It’s a fast-growing alternative to the leading $2 billion GitHub, the high-profile Silicon Valley startup.
 
And as of Wednesday morning, GitLab was only just starting to come back online. But even worse than the embarrassment of such major downtime, the company now has to warn a handful of its users that some of their data might be gone forever.

 

DA 300 x 300

 

The bad day started on Tuesday evening, when a tired GitLab system administrator, working late at night in the Netherlands, tried to fix a slowdown on the site by clearing out the backup database and restarting the replication process. Unfortunately,  the admin accidentally typed the command to delete the primary database instead, according to a blog entry.
 
And by the time he noticed and scrambled to stop the deletion, it was too late. Of around 300 GB of files, only about 4.5 GB is left, according to the blog. The site had to be taken down for emergency maintenance while they figured out what to do, keeping users apprised via its blog, Twitter, and a Google Doc that the GitLab team kept updated as new developments arose.
 


 
Making matters worse, the very recent backups are seemingly unrestorable. “Out of 5 backup/replication techniques deployed, none are working reliably or set up in the first place” the blog said. “We ended up restoring a 6 hours old backup,” said Interim VP of Marketing Tim Anglade, which means that any data created in that six-hour window may be lost forever.
 
While in the process of restoring that older version of the database, the site went completely down for at least six hours, Anglade says. Worse, intermittent failures while they got the service back online took another several hours, with everything only starting to get back to normal on Wednesday morning.

 

The entire GitLab team, as of September 2016

 

The good news, says Anglade, is that the database that was affected didn’t actually contain anyone’s code, just stuff like comments and bug reports. Furthermore, Anglade says that the many customers who installed GitLab’s software on their own servers weren’t affected, since that doesn’t connect up to GitLab.com. And paying customers weren’t affected at all, the company said, which minimize the financial impact.
 
The outage is bad, as is the looming possibility that some of that data might be gone, Anglade acknowledges, but nobody is going to have to start rewriting their software from scratch, and only around 1% of GitLab’s users will see any lasting effects from this incident.
 
As for the systems administrator who made the mistake, Anglade is hesitant to place blame, since it was really the whole team’s fault that none of their other backup systems were working. “It’s fair to say it’s more than one employee making a mistake,” he says.

 
 

Leave a Reply

Your email address will not be published. Required fields are marked *