This topic is for status updates on the server that hosts the PWC website, forums and email.
There will be some downtime later this week (or possibly early next) in order to replace one of the disks on the server. More accurate time and duration estimate when I have the disk and a plan.
Apologies for the overnight downtime, if any of our insomniacs were suffering withdrawal symptoms.
Apparently it was scheduled BT downtime of bits of the Peterborough exchange. Sufficiently scheduled that they didn't actually bother to tell anyone.
Disc swap should be happening this weekend - watch this space for downtime.
As it happens, it went down while I was in the middle of posting something. Not amused. >:(
Quote from: Rob Farley on February 28, 2012, 09:47:07 am
As it happens, it went down while I was in the middle of posting something. Not amused. >:(
Not half as unamused as the local friend who manages the ADSL for various P'boro businesses and got a 1am phone call as a result, I suspect. He was markedly dischuffed this morning.
I have made a slight change to the config of the line the server is hosted on: basically, my ISP occasionally rolls out new code on one of a pair of machines, and then switches lines over to the new code in the small hours of the morning (to minimise disruption if the new code isn't up to snuff). The switch is done automatically, and I get to choose the time. Since evidently some people *do* post at 1 or 2 am, I've set the line to switch at 4am, so in the unlikely event of their being a switchover and it failing for a few minutes, no-one sane (hah) should be affected.
Oh, also? We're on the 80Mb/20Mb fibre test at present, so downloads should be *blindingly* fast.
Advance warning - the server that hosts the forums will be going down briefly while alterations are made to our fusebox - this should be tomorrow morning sometime.
If I get time, I'll try and swap the drive at the same time.
The site and server will be down for about 5 mins at about 5.30pm today (Thurs 5th), for security updates.
Apologies for the half an hour downtime earlier this afternoon - this was due to a BT routing problem outside my control.
Ok, now THAT one was entirely my fault, due to cocking up an upgrade that was supposed to result in about 2 mins or less downtime.
Expect another (hopefully brief) period of downtime early this (Thursday) evening while I finish off the job.
More downtime, and hopefully the last.
The server that hosts the forums/website will be down for a chunk of Friday early evening while I (finally) replace a hard drive and upgrade the kernel.
Apologies for this - hopefully this will be the last time for a good long while.
There will be some brief server downtime over this coming weekend (22nd/23rd) for security updates to the OS. I'll try and keep it to a minimum and not do it on an evening - will keep you posted as to precisely when.
QuoteThere will be some brief server downtime over this coming weekend (22nd/23rd) for security updates to the OS. I'll try and keep it to a minimum and not do it on an evening - will keep you posted as to precisely when.
Not to worry, I'm sure we'll all be too busy shopping for/hiding/wrapping gifts this weekend what we should have done already!
It looks like my UPS is sitting downstairs here at Goods In, so there will be some downtime late this evening to redo the spaghetti around the new server and hook it up to the UPS.
(http://highport.altrion.org/IMG_2088.JPG)
Did anyone else have problems getting on here yesterday afternoon, about 2pm onward?
Yes. Everyone. The server was in bits on my kitchen table.
Server was down from Monday evening (while we were at the club) until almost 4PM Wednesday when the new machine was brought online. Mike sent out various emails covering this, but for the benefit of anyone who didn't get them this was a terminal hardware failure and we are now on the shiny new box pictured above.
Given the severity of the problem I think Mike deserves a round of applause for getting us back up and running so quickly,
Quote from: Rob Farley on January 17, 2013, 03:47:55 pm
Given the severity of the problem I think Mike deserves a round of applause for getting us back up and running so quickly,
*blush*
To avoid taking undue credit - all that died was the UPS and the power supply. Unfortunately, the power supply for the old case was an integral part of the case (and also perhaps a little underspecified for the purpose), so I couldn't just swap it for a new one. All I had to do was buy a new (and roomier!) case and PSU, and swap the motherboard and disk drives across, which took me a whole 45 mins.
Given the standard set by our previous server admin, you can understand why I might be getting a little overenthusiastic about basic competence. ;D
Mike - 2 hours to notice the issue and 2 days to fix.
Old admin - 2 days without noticing and 2 weeks to fully fix.
Yes. round of applause deserved.
Quote from: Daniel Phillips on January 17, 2013, 05:01:35 pm
Mike - 2 hours to notice the issue and 2 days to fix.
Old admin - 2 days without noticing and ignoring all emails telling him about it and 2 weeks to fully fix.
Updated for accuracy.
Quote from: Daniel Phillips on January 17, 2013, 05:01:35 pm
Mike - 2 hours to notice the issue and 2 days to fix.
Old admin - 2 days without noticing and 2 weeks to fully fix.
Well, if you're going to be that pedantic:
Mike: 2 minutes after walking in the door to notice, 40 hours to fix.
Given the new UPS, I would have noticed while still at the club, as it will SMS me on loss of power :D
QuoteGiven the severity of the problem I think Mike deserves a round of applause for getting us back up and running so quickly,
Indeed, couldn't agree more. I only mentioned the outage as I thought it was just me, didn't connect it with the new thingy or even consider that the whole shebang might have gone belly up for all - which I suppose is a testament to recent impeccability of service.
As you may have noticed, we went down - we're now back with much tidier cables and a UPS :D
I would flag that our major remaining issue is that I don't have hot spares for the boot or data disks at present, or a backup power supply, which could result in downtime of circa '10 mins after PC World opens + however long it takes to restore' for the first and last of those, and 'running with no disk mirroring until I can buy a second drive' for the latter. As I personally think that's unacceptable (especially since I'm taking over hosting our church website next month) I'll be doing something about that :D
Rob: does SMF come with backup scripts?
Apologies for the short notice, but as I find myself with time on my hands, I will be upgrading the server to Ubuntu 12.10 over the next hour or so. Expect some downtime.
P.S. Ran a UPS test yesterday - worked a treat :D
P.P.S. slight delay as I wanted to take a full backup first :D
All done.
There will be (yeah, yeah, I know) a longer period of downtime sometime in the next week to replace the boot disk with an SSD, adjust a bit of the partitioning scheme, and drop in an extra SATA card. I'll keep you posted.
The site will be up and down several times today (Sat 26) while I do the aforementioned.
I'll be doing a reboot late this evening (15th) for some security updates, and to repurpose the old boot disk as a logs drive. Down time should be circa 5 mins.
Apologies for the downtime this am (Thursday 27 July) - the machine decided to eat all its free memory and go into terminal decline.
I nipped home to reboot it - now fixed. I may be rebooting it again this or next weekend for an OS update and possibly some more RAM.
Try Aricept.
Just as a heads up - the server's boot SSD died this lunchtime - currently running on an out of date backup drive which will be updated from the Amazon S3 backup shortly. May be a little more downtime to do that later this evening.
Your grumpy sysadmin.
Due to a slight cockup on my part, I've had to restore from a backup taken yesterday at 3pm. Any forum posts since then will have vanished.
Thanks
Thanks
Apologies for server downtime this (Sunday) afternoon. We appeared to pick up a particularly nasty flavour of DDOS attack from some bunch of script kiddies with no brains and more time, so I've spent a little while upgrading and reconfiguring our firewall to make that harder for the next one.
The new firewall rules appear to be quite effective - it even blocked an online portscanning tool for the crime of... portscanning the network :D
The forum and website will be down for a bit later this evening as some nice person has just handed me 4GB of RAM to put in it :D
Whoops. Apologies for the forum downtime just now. I installed the new RAM, forgot to restart the webserver process and headed off to church.
Terribly sorry and all that :D
There may have been a brief glitch around 10:30am today due to the server's home partition becoming... how shall we put this, fuller than I'd like.
Fixed short term, but I may schedule some downtime this weekend to do some partition management.
The server will be down for a portion of this afternoon/evening, as I need to update it to patch the Shellshock vulnerability.
The patch is a touch more complex (due to the current version of the server software being unsupported, so I shall be doing this this afternoon.
Well. That WAS exciting. Not.
Short summary:
Server broke. Server fixed. Mike awesome.
Longer summary:
Began to notice lots and really ohfuckthisisnotgood lots of disk errors on the server around midnight last night. All three RAID partitions were marked as one disk of the pair failed, but rather alarmingly, it wasn't the same disk in all cases.
Rebooted the server, since it was in a confused state as a result and wasn't letting me fix the array.
No RAID disks visible on the SATA bus at all - BIOS times out trying to find them.
Faffed around till about 2am, gave up and slept on it. Starting to suspect the SATA card OR the motherboard was dead.
Woke up with an 'aha'. Removed the drive I thought was suspect, swapped in the spare I keep. Presto, BIOS finds all the drives, all the RAID volumes happily claim they have one good partition but could they have another please? Reformat the spare drive, add the appropriate partitions to the appropriate volumes, and leave it to rebuild. It appears that the dud drive was so dead it had been taking the entire SATA bus down with it.
Restart the web server process, and here we are.
Lessons:
Always, always, expect your shit to fail.
Leaving the email address in mdadm.conf as local when /var/mail is failing means guess what? you don't get emailed when the RAID fails :D
Note to self:
We are now running in 'no spare disks' mode. Go buy some more hard drives.
Server will be going down for about half an hour around about 4.30pm UK time (11.30am Eastern) today (Friday), as there are two shiny new hard drives and a smaller, more energy efficient power supply sitting on my desk waiting to be installed.
There will also be a longer period of downtime, probably around 9am UK/4am Eastern, on Saturday, to install a new boot SSD and securty updates to the operating system. Watch this space for more.
Apologies for brief site and forum downtime around 10pm Tuesday 4th - it appears the cable between my router and my BT modem has been on the way out for quite a while, and attempting to troubleshoot why we were getting rubbish download speeds finally killed it.
Fixed now.
You'd think I'd learned from this last time.
Unfortunately, the forum wasn't terribly readable last night due to a very similar problem. On the good side, when your email starts receiving:
QuoteThis is an automatically generated mail message from mdadm
running on highport.altrion.org
A Fail event had been detected on md device /dev/md0.
It could be related to component device /dev/sde1.
Faithfully yours, etc.
... you do get a bit more warning than 'oh shit my disk is failing'
Now fixed (same problem as last time - unhappy drive making SATA bus unhappy). MORE spare drives bought.
Quote from: Mike Whitaker on October 01, 2014, 09:56:39 am
Well. That WAS exciting. Not.
Short summary:
Server broke. Server fixed. Mike awesome.
Longer summary:
Began to notice lots and really ohfuckthisisnotgood lots of disk errors on the server around midnight last night. All three RAID partitions were marked as one disk of the pair failed, but rather alarmingly, it wasn't the same disk in all cases.
Rebooted the server, since it was in a confused state as a result and wasn't letting me fix the array.
No RAID disks visible on the SATA bus at all - BIOS times out trying to find them.
Faffed around till about 2am, gave up and slept on it. Starting to suspect the SATA card OR the motherboard was dead.
Woke up with an 'aha'. Removed the drive I thought was suspect, swapped in the spare I keep. Presto, BIOS finds all the drives, all the RAID volumes happily claim they have one good partition but could they have another please? Reformat the spare drive, add the appropriate partitions to the appropriate volumes, and leave it to rebuild. It appears that the dud drive was so dead it had been taking the entire SATA bus down with it.
Restart the web server process, and here we are.
Lessons:
Always, always, expect your shit to fail.
Leaving the email address in mdadm.conf as local when /var/mail is failing means guess what? you don't get emailed when the RAID fails :D
Note to self:
We are now running in 'no spare disks' mode. Go buy some more hard drives.
But two years between failures is quite good.
One of the drive pair failed Friday night (obviously while I was away, because why the hell wouldn't it?)
Rebooted, swapped in the spare from the two on my bookcase, now rebuilding. Everyhing should be fine.
Advance warning - there will be some weekend downtime sometime in the next couple of weeks, as I have just bought the 5-year old server's replacement - I have a suspicion the case fan and the SATA bus may be on the way out, and it's past due for a replacement anyway.
Expect some downtime on Saturday evening (May 5th) while it all gets transferred over to the new box.
Also possibly expect some downtime Wednesday afternoon (May 2nd) as we're having a smart meter installed. Server and net connection are on the UPS, so should stay up if the power outage is brief enough.
Running a fairly large security update on the server - this will result in a bit of downtime over the next couple of hours.
phase 1 completed - there will be more downtime, as shutting down the web server seems to be the first thing it does as part of the update :D
Should be sorted now, will need one reboot but I want to do that with a screen on the server, not remotely,. so that will be sometime tomorrow.
Apologies for downtime from about 22.05 on 22 March - local power cut. All now restored, except that I will need to take things down again sometime tomorrow to swap the UPS back in.
Quote from: Mike Whitaker on March 22, 2019, 11:15:22 pm
Apologies for downtime from about 22.05 on 22 March - local power cut. All now restored, except that I will need to take things down again sometime tomorrow to swap the UPS back in.
Yes I reported the power cut, covered quite a large area!
Failed HD on server - various services are down.
HEADS UP:
I will be shutting this server down in the next couple of hours to move the databases across. You *may* lose any new posts, replies and PMs from about 5pm onwards.
Ok. We're back, I think.
It appears the new server expects php7, so I had to do a wee bit of magic to get this to run under php5.6 :D
Any issues, please drop me a message.
Testing.
Ok, it turns out that the updated forum software is so cutting edge that SMF only have a single theme that's confirmed as compatible.
I've installed it, but I'll keep an eye out for other options.
Still need to reset the logo, not sure of the filepath on Mike's server.
Quote from: Rob Farley on May 21, 2019, 02:14:24 amOk, it turns out that the updated forum software is so cutting edge that SMF only have a single theme that's confirmed as compatible.
I've installed it, but I'll keep an eye out for other options.
Likewise - I'm assuming as soon as 2.1 stops being a RC loads of folks will upgrade their themes.
Those of you who like me hate light on dark, can use the theme picker (click on your username, Forum Profile -> Look and Layout, and click on 'change' to the right of Current Theme)- Curve2 is the dark on light original default.
QuoteStill need to reset the logo, not sure of the filepath on Mike's server.
Root dir is (for historical reasons) /var/www/pwc.altrion.org/forums/
Have fixed smileys, I think.
Quote from: Mike Whitaker on May 21, 2019, 10:40:15 amThose of you who like me hate light on dark, can use the theme picker (click on your username, Forum Profile -> Look and Layout, and click on 'change' to the right of Current Theme)- Curve2 is the dark on light original default.
Personally, I've always liked a light-on-dark theme. I use it on all my terminal sessions on Linux. It's easier on the eyes and takes me back to the old days with VT240 and Wyse terminals which were green or amber on black. Good Times!
Colin
Quote from: Mike Whitaker on May 21, 2019, 10:40:15 amThose of you who like me hate light on dark, can use the theme picker (click on your username, Forum Profile -> Look and Layout, and click on 'change' to the right of Current Theme)- Curve2 is the dark on light original default.
I turned that option on specifically because I know light on dark is an issue for some people, either aesthetically or medically.
Quote from: Rob Farley on May 21, 2019, 12:23:23 pmQuote from: Mike Whitaker on May 21, 2019, 10:40:15 amThose of you who like me hate light on dark, can use the theme picker (click on your username, Forum Profile -> Look and Layout, and click on 'change' to the right of Current Theme)- Curve2 is the dark on light original default.
I turned that option on specifically because I know light on dark is an issue for some people, either aesthetically or medically.
Thank you for that, I actually couldn't even see some of the text on the default theme. Fortunately could read enough to work out how to change it!
Other than default colour scheme, all looks good to me. Messages were better as it auto saved sent message rather than me having to tell it to.
We've even got a specific smiley for Andy Miller :police:
And just in case there are any fellow luddites out there, I have just had to redo all my notification settings, which had all defaulted to me receiving no emails on anything.
Notifications tab is reached through your user profile. Mike or Rob will probably be able to provide better instruction if needed.
I will be taking the website and forum down for a few minutes in the next hour or so in order to tidy up a whole mess of cables under the desk.
And done. Helps if you remember which of the TWO server network ports is eno1 :D
There are now another couple of forum themes available for people to choose from. They're nothing particularly special but if people want a choice you've now got one.
Once again, this is every theme currently available to us on this version of the software. For a grand total of 4.
Apologies for the site downtime overnight yesterday/today. As far as I can see, something (I suspect a misconfigured jail in fail2ban) started grabbing open file handles until the system ran out :D
I have, I believe, fixed this so it won't happen again :D
Notice: there will be some downtime for the club website over the weekend while I try and convince the server to run PHP7.2 rather than PHP5.x.
Any swearing from the direction of Werrington can be mostly ignored, though if it goes on too long, please deliver chocolate and alcohol to the usual location :D
Apologies for the downtime over lunch on Monday 13th - this was due to upgrading the core routers (yes, my house has core routers and now a 1GB backbone) which took longer than the anticipated straight swap.
All should be back and healthy now. And I can move video files really fast! WHIZZZZ!
I think I've just fixed an issue with topic notifications - if you get alerts for posts to long-outstanding replies to topics you'd subscribed to, this is why :D
I shall be doing security updates this afternoon and early evening, so the site may be down briefly a couple of times.
Apologies if anyone was trying to post and got an error earlier.
The disk filled up. I have now MOVED the backups staging area onto a 900GB volume where that's not going to happen again :D
Apologies to the nocturnal amongst you for the overnight downtime - it appears that TalkTalk were doing scheduled maintenance on the Peterborough exchange which overran quite spectacularly.
Quote from: Mike Whitaker on October 05, 2021, 08:02:01 amApologies to the nocturnal amongst you for the overnight downtime - it appears that TalkTalk were doing scheduled maintenance on the Peterborough exchange which overran quite spectacularly.
I just assumed you had been influenced by Facebook!
Any idea when the site will go HTTPS, as my browser has started nagging and warning me!
Quote from: Carl Fisher on November 21, 2021, 07:21:25 pmAny idea when the site will go HTTPS, as my browser has started nagging and warning me!
Good reminder - I'll see if I can sort it this week, as it's a relatively quick job.
Quote from: Mike Whitaker on November 22, 2021, 10:05:20 amQuote from: Carl Fisher on November 21, 2021, 07:21:25 pmAny idea when the site will go HTTPS, as my browser has started nagging and warning me!
Good reminder - I'll see if I can sort it this week, as it's a relatively quick job.
Famous last words!
Quote from: Carl Fisher on November 22, 2021, 11:33:07 amQuote from: Mike Whitaker on November 22, 2021, 10:05:20 amGood reminder - I'll see if I can sort it this week, as it's a relatively quick job.
Famous last words!
....should be.... :D
Done. Please let me know if you see any issues.
I have I think fixed things so the server sends email correctly again.
There will be some site downtime on Saturday while I attempt to resolve the versioning mess that exists around PHP versions higher than 7.0, the forums, and WordPress.
Assuming I get this working, everything should apparently be about 400% faster!
Quote from: Mike Whitaker on February 02, 2022, 03:52:07 pmThere will be some site downtime on Saturday while I attempt to resolve the versioning mess that exists around PHP versions higher than 7.0, the forums, and WordPress.
Assuming I get this working, everything should apparently be about 400% faster!
I got fed up, so we're now running php7.4 - apologies for the odd patches of downtime.
There may still be some site downtime and weirdness on Saturday as I'm going to be swapping to a new Wordpress theme.
Aplogies for the unavailability of the forums from the main site menu since yesterday, and (for a while) the whole site, this afternoon (Sun 8th May).
Tracked down to adding a new site to the server, and various config files not interacting in a polite manner. They have been Spoken To Severely.