Lisias Posted December 20, 2024 Share Posted December 20, 2024 It's about 2 of 3 days in which I barely got any 502 errors, and I got none in the last 24 hours. Someone else can confirm this? (I still get some on my monitoring tools, but even them are showing some improvement, with the "clean window" getting wider) Quote Link to comment Share on other sites More sharing options...
Mr. Kerbin Posted December 20, 2024 Share Posted December 20, 2024 2 hours ago, Lisias said: It's about 2 of 3 days in which I barely got any 502 errors, and I got none in the last 24 hours. Someone else can confirm this? (I still get some on my monitoring tools, but even them are showing some improvement, with the "clean window" getting wider) Generally, yes. Other people, seem not so much. 1 in the beginning of the day, 1 in the middle, and 1 at the end 071612202024 Quote Link to comment Share on other sites More sharing options...
Scarecrow71 Posted December 21, 2024 Share Posted December 21, 2024 On 12/20/2024 at 4:22 AM, Lisias said: It's about 2 of 3 days in which I barely got any 502 errors, and I got none in the last 24 hours. Someone else can confirm this? (I still get some on my monitoring tools, but even them are showing some improvement, with the "clean window" getting wider) Well, I so wish this was true. Unfortunately, it took me 45 minutes to post in the "What did you do in KSP1 today" thread due to site issues. Quote Link to comment Share on other sites More sharing options...
Lisias Posted December 21, 2024 Share Posted December 21, 2024 (edited) 17 hours ago, Scarecrow71 said: Well, I so wish this was true. Unfortunately, it took me 45 minutes to post in the "What did you do in KSP1 today" thread due to site issues. I got some 502 for a couple minutes half an hour ago, and from that point Forum is working fine to me. Oh, krap. I don't even know if this is an improvement or not... ==== POST EDIT ==== Anyway... Follows my monitoring reports for the past 7 days. Dec 18 and 19 I was updating the rig that hosts the monitoring, being the reason for a flat line on one of the graphics. The "worst response time" chart shows a slightly improvement related to 7 days ago, but then I got a huge spike at 18:00Z today. Yet more interesting, there was another spike at 20:00Z on Dec 15. Essentially a week ago - perhaps a pattern? In the last two days, I see an apparent worsening on the 50x events - completely denying my claims of apparent improvement. However, the CF subnet I'm monitoring is different from the one I use at home. Interesting enough, I'm noticing that every day we have 3 hours without events, and that apparently they are 6 hours apart from each other - 1 good hour, 5 screwed ones, repeat. It's a huge improvement from Dec 15, no doubt, and apparently in pace with Dec 16 and 17 - only the distribution over the hours changed. I don't like coincidences. Good hours each 6 hours, terrible response times each 6 days... Looks some automated processes ended up syncing while directly or indirectly causing the borkage? It's interestingly the same M.O. : some 502, then some 503, then profit... With a few 504 now and then... Spoiler We once royally screwed up our infra doing a kinda misconfigured watchdog - the thing was intended to detect with the main subnet gets botched, and automatically switch traffic to the backup up one, and then monitor the backup to switch again if the same happens. Problem - we thought it wold be a good idea to monitor 8.8.8.8 as mine canary, because... Google, right? This thing never goes down. Well, apparently it did once - because our watchdog could not reach 8.8.8.8 one sunny day, switched to the backup subnet, couldn't reach it neither, and panicked into a loop of subnet switching fest while both subnets were working fine, what - unsurprisingly - played havoc on our QoS triggering a tsunami of support tickets (and phone ringings). We took the whole dawn tearing our hairs out trying to figure out what in hell it was happening, because obviously our manual tests were telling us everything was OK - until someone ran a test that (didn't) reached 8.8.8.8, found it weird (hey, Google!!), making a bell ring in someone's head that then shut the watchdog down, solving the problem. Anyway... Interesting enough, Forum looks good to me right now. === POST POST EDIT === I let this one pass through, my apologies. There's another possible interpretation for a 503 Service Unavailable : It still means that there's no one left available to service the request, HOWEVER, it's also raised when you have workers around, but they are all busy with something else, and so there's no one left available to service you the same - having no one left available for servicing you may mean both there's no one around, as the ones around are all busy and can't talk to you now. So, and due the regularity things are happening on my reports (possibly meaning quotas being exhausted), it may be just a too much reduced budget to pay for hosting the Forum. They need to scale up things a bit, but the host is not going to do that for free - obviously. On my DayJob© (and this also implies our partners) a 503 means that we had shutdown the services because... HELL, we never put our servers at full capacity, this would be suicidal. We do that extra mile to guarantee that no request would be left behind, because there's someone paying for that request and letting it down will mean that someone paid for something that wasn't delivered. There's always about 20% of idleness on the servers because we use this 20% as a trigger that we are overloading and someone need to scale things up (what always involves footing more money, sometimes lots of money). But Forum works on a different paradigm - people using it are not the ones paying for it, and so we have a completely different dynamic, and the alternate meaning for a 503 makes sense. The double posting I had experienced (by doing reload with sending again the form when getting a 502) strongly suggests that the quota being exhausted is not from the BackEnd/Database, but instead something related to outbound traffic. AWS, for example, don't charge for incoming data, but charges for outcoming traffic. If whatever Forum is using as host does the same, we have (another) very good explanation for the symptoms I'm describing. Given the (at least) temporary relief I'm experiencing, I think that someone decided to bite the bullet and footed some more money on the server farm. But once Forum started to behave a bit better, more people come back to it (we had nearly 3K guests yesterday, where the average on the last months were between 1.200 to 1.800) and the problem started to happen again by obvious reasons. Edited December 22, 2024 by Lisias POST POST EDIT Quote Link to comment Share on other sites More sharing options...
ColdJ Posted December 25, 2024 Share Posted December 25, 2024 (edited) On 12/17/2024 at 6:46 PM, SunlitZelkova said: I’ve been getting so many bad gateway messages I didn’t realize the forum had been up at all since late October. Oh I see. The forum server is a Gateway Computer from 2004. On 12/17/2024 at 6:46 PM, SunlitZelkova said: I’ve been getting so many bad gateway messages I didn’t realize the forum had been up at all since late October. Oh I see. The forum server is a Gateway Computer from 2004. On 12/17/2024 at 6:46 PM, SunlitZelkova said: I’ve been getting so many bad gateway messages I didn’t realize the forum had been up at all since late October. Oh I see. The forum server is a Gateway Computer from 2004. @Lisias The triple posting is what I saw after, when I tried to post this, I got a Bad Gateway 502 for 5 minutes. Edited December 25, 2024 by ColdJ Quote Link to comment Share on other sites More sharing options...
kerbiloid Posted December 26, 2024 Share Posted December 26, 2024 (edited) Every time when somebody opens tries to open the forum Spoiler The last forum software update from RobCo... Try to use the Server #5 instead of Server #9. It has an outdated version of UOS, and never accepts the updates. Upd. Spoiler No such error code in the reference. https://fallout.wiki/wiki/Unified_Operating_System#Termlink Spoiler Image Code Message attached Purpose Place where a terminal shows it 0x03C663A1 Network connection not found Dummy terminal inactive The BIOS could not find the network link. 1 in Rivet City's midship deck, common room 1 in a Camp Golf Tent, Camp McCarran tent, Securitron Vault and Tops 13th floor (room Yes-Man) 0x0AABFF00 Primary power source unavailable Check all cords and plugs for connection Power cord / power supply unit failure. 2 in the Weather Monitoring Station 2 in Allied Technologies Offices 1 in the visitors center 1 in Las Vegas Boulevard Station 1 in the House Resort 0x07F6BAAC Bad data. Cannot read A fatal error due to a memory / network / storage failure. 1 in the Evergreen Mills foundry 11 in Vault 3 living quarters 0x00B636C6 No input device found Connect a keyboard The BIOS could not find the keyboard. 1 in Fort Independence's lower level and 1 in Evergreen Mills foundry 2 in Vault 3's living quarters 0xFFFFF710 Processor Corru;xsfkleg,,g364[735]3__. Corrupt processor. 1 in Fort Independence's main level 1 in the Goodsprings Schoolhouse 0xF141A013 No Data Storage Detected. Check Tape Drive Connection. No storage device present 1 in the Megaton Clinic 1 on Tops 13th floor (room Yes-Man) 0x357C5001 Bad Sectors Found In Boot Block. Terminal Error. Bad sector 1 in the Drainage Chamber near Broadcast Tower KT8 N/A 0xFA770171 Segmentation fault. Improper authorization or corrupted file string RobCo Sales & Service Center public display terminal N/A 0x0D890102 Boot sector invalid/corrupt. Corrupt or missing BIOS/MF Boot Agent RobCo Sales & Service Center public display terminal N/A 0xFFF11011 Memory fault. Unreadable memory access RobCo Sales & Service Center public display terminal N/A 0xC0001011 Critical failure. Unknown critical system failure RobCo Sales & Service Center public display terminal N/A 0x00001001 Critical failure. Unknown critical system failure RobCo Sales & Service Center public display terminal N/A 0xA001C007 Invalid request. Corrupted or non-privileged write attempt Watoga Emergency Services complaint database terminal But as the code 502 = 0x1F6 obiously belongs to the "External codes" section Spoiler External error Image Code Message attached Purpose Place where a terminal shows it N/A 0x081 Data loss detected. Lost data from a holotape Chestnut Hillock Reservoir, Edwin's terminal N/A 0x104 Manual controls are disabled while Nuka-Galaxy is in AUTOMATIC Ride Mode. Manual controls were attempted while ride was in Automatic Mode Nuka-Galaxy, control terminal N/A 0x107 Catastrophic physical damage detected. Extensive damage to the external device Vault 94, G.E.C.K. monitoring station terminal N/A 0x109 A core meltdown is in progress. Complete or partial collapse of the nuclear core, such as with a Garden of Creation Kit Vault 94, G.E.C.K. monitoring station terminal N/A 0x117 Critical damage to mobility system. Unit is unable to move. Severe damage to legs or propulsion jet RobCo Battlezone, Robotics diagnostics terminal N/A 0x281 Critical damage to targeting system. While in combat, Frenzy]unit will attack random targets. Severe damage to Combat inhibitor RobCo Battlezone, Robotics diagnostics terminal N/A 0x391 Multiple malfunctions detected. The ride has been halted until safe conditions can be restored. Malfunctions from several sources have triggered a safety lockout Nuka-Galaxy, control terminal N/A 0x401 Unknown Unknown Vault 94, G.E.C.K. monitoring station terminal N/A 0x402 Unknown Unknown Vault 94, G.E.C.K. monitoring station terminal N/A 0x519 Unknown Unknown Vault 94, G.E.C.K. monitoring station terminal N/A 0x509 Unable to connect to the ABIS Biometrics ID Fabricator. Disconnected or malfunctioning automated biometric identification system Missile Silo terminal entries, biometrics system terminal , the problem is probably at the peripheral device side. Probably, some gateway failure. Maybe molerats, maybe metal scrap hunters. Edited December 26, 2024 by kerbiloid Quote Link to comment Share on other sites More sharing options...
Lisias Posted December 26, 2024 Share Posted December 26, 2024 On 12/25/2024 at 2:35 AM, ColdJ said: @Lisias The triple posting is what I saw after, when I tried to post this, I got a Bad Gateway 502 for 5 minutes. Yep. This is the proof that whatever is happening, is not Forum but something where Forum "lives", i.e., the infra from the server farm. This is what happens: You post a new message Forum receives it, process it, and sends the response with the new page But the response is lost, and someone in the middle sends you a 502 Then you immediately post again a message Forum receives it, process it (in this case, merging with the last message you sent), and sends the response with the page updated. But the response is lost, and someone in the middle sends you a 502 again Rinse, repeat. If Forum itself would be screwed (as it really happened months ago), you would not be being able to post anything - what to say merging previous posts. So Forum is healthy, and it's probably healthy for months. The trouble maker is someone above Forum and below CloudFlare, what strongly suggests the server farm is the responsible for the problem somehow. From my experience with AWS, I know that it's usual to impose quotas for the outbound traffic while making inbound traffic free. IF AND ONLY IF the server farm hosting Forum does the same, my best guess is, indeed, a too drastic costs reduction as I had proposed initially - but not on the number of servers hosting Forum, but on the very outbound traffic quota (what I think should be cheaper than the servers, but - obviously - I was wrong on this detail). 2 hours ago, kerbiloid said: Every time when somebody opens tries to open the forum Well... It's not too different from what I was doing initially, but then I built some logging tools and left this suffering behind! Quote Link to comment Share on other sites More sharing options...
Fizzlebop Smith Posted December 26, 2024 Share Posted December 26, 2024 (edited) I live in Mountain Time USA Between 7pm until 10pm I have a very very hard time connecting to the forum. It will initially connect but any link will produce the error. Refreshing the page a few times will have varied success, but i get nothing after the initial load when trying to redirect somewhere deeper into the forum. Other than that i have had little issues. The connectivity in general "seems" better but i do have almost a complete forum black out at certain times of night. This would support the theory of max outbound (in my opinion) depending on when the fiscal refresh occurs on that metric, my problem being near the end of MY day. Edited December 26, 2024 by Fizzlebop Smith Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.