Jump to content

So, we had some kind of technical problem.


Vanamonde

Recommended Posts

It's about 2 of 3 days in which I barely got any 502 errors, and I got none in the last 24 hours.

Someone else can confirm this?

(I still get some on my monitoring tools, but even them are showing some improvement, with the "clean window" getting wider)

Link to comment
Share on other sites

2 hours ago, Lisias said:

It's about 2 of 3 days in which I barely got any 502 errors, and I got none in the last 24 hours.

Someone else can confirm this?

(I still get some on my monitoring tools, but even them are showing some improvement, with the "clean window" getting wider)

Generally, yes.

Other people, seem not so much.

1 in the beginning of the day, 1 in the middle, and 1 at the end

071612202024

Link to comment
Share on other sites

On 12/20/2024 at 4:22 AM, Lisias said:

It's about 2 of 3 days in which I barely got any 502 errors, and I got none in the last 24 hours.

Someone else can confirm this?

(I still get some on my monitoring tools, but even them are showing some improvement, with the "clean window" getting wider)

Well, I so wish this was true.  Unfortunately, it took me 45 minutes to post in the "What did you do in KSP1 today" thread due to site issues.

Link to comment
Share on other sites

17 hours ago, Scarecrow71 said:

Well, I so wish this was true.  Unfortunately, it took me 45 minutes to post in the "What did you do in KSP1 today" thread due to site issues.

I got some 502 for a couple minutes half an hour ago, and from that point Forum is working fine to me.

Oh, krap. I don't even know if this is an improvement or not... :/

==== POST EDIT ====

Anyway... Follows my monitoring reports for the past 7 days. Dec 18 and 19 I was updating the rig that hosts the monitoring, being the reason for a flat line on one of the graphics.

The "worst response time" chart shows a slightly improvement related to 7 days ago, but then I got a huge spike at 18:00Z today. Yet more interesting, there was another spike at 20:00Z on Dec 15. Essentially a week ago - perhaps a pattern?

20241221.Worst-Response-Time.png

In the last two days, I see an apparent worsening on the 50x events - completely denying my claims of apparent improvement. However, the CF subnet I'm monitoring is different from the one I use at home.

Interesting enough, I'm noticing that every day we have 3 hours without events, and that apparently they are 6 hours apart from each other - 1 good hour, 5 screwed ones, repeat.

20241221.Events.png

It's a huge improvement from Dec 15, no doubt, and apparently in pace with Dec 16 and 17 - only the distribution over the hours changed.

I don't like coincidences. Good hours each 6 hours, terrible response times each 6 days... Looks some automated processes ended up syncing while directly or indirectly causing the borkage? It's interestingly the same M.O. : some 502, then some 503, then profit... With a few 504 now and then...
 

Spoiler

We once royally screwed up our infra doing a kinda misconfigured watchdog - the thing was intended to detect with the main subnet gets botched, and automatically switch traffic to the backup up one, and then monitor the backup to switch again if the same happens.

Problem - we thought it wold be a good idea to monitor 8.8.8.8 as mine canary, because... Google, right? This thing never goes down.

Well, apparently it did once - because our watchdog could not reach 8.8.8.8 one sunny day, switched to the backup subnet, couldn't reach it neither, and panicked into a loop of subnet switching fest while both subnets were working fine, what - unsurprisingly - played havoc on our QoS triggering a tsunami of support tickets (and phone ringings).

We took the whole dawn tearing our hairs out trying to figure out what in hell it was happening, because obviously our manual tests were telling us everything was OK - until someone ran a test that (didn't) reached 8.8.8.8, found it weird (hey, Google!!), making a bell ring in someone's head that then shut the watchdog down, solving the problem.

Anyway... Interesting enough, Forum looks good to me right now.

=== POST POST EDIT ===

I let this one pass through, my apologies. There's another possible interpretation for a 503 Service Unavailable : It still means that there's no one left available to service the request, HOWEVER, it's also raised when you have workers around, but they are all busy with something else, and so there's no one left available to service you the same - having no one left available for servicing you may mean both there's no one around, as the ones around are all busy and can't talk to you now.

So, and due the regularity things are happening on my reports (possibly meaning quotas being exhausted), it may be just a too much reduced budget to pay for hosting the Forum. They need to scale up things a bit, but the host is not going to do that for free - obviously.

On my DayJob© (and this also implies our partners) a 503 means that we had shutdown the services because... HELL, we never put our servers at full capacity, this would be suicidal. We do that extra mile to guarantee that no request would be left behind, because there's someone paying for that request and letting it down will mean that someone paid for something that wasn't delivered. There's always about 20% of idleness on the servers because we use this 20% as a trigger that we are overloading and someone need to scale things up (what always involves footing more money, sometimes lots of money).

But Forum works on a different paradigm - people using it are not the ones paying for it, and so we have a completely different dynamic, and the alternate meaning for a 503 makes sense.

The double posting I had experienced (by doing reload with sending again the form when getting a 502) strongly suggests that the quota being exhausted is not from the BackEnd/Database, but instead something related to outbound traffic. AWS, for example, don't charge for incoming data, but charges for outcoming traffic. If whatever Forum is using as host does the same, we have (another) very good explanation for the symptoms I'm describing.

Given the (at least) temporary relief I'm experiencing, I think that someone decided to bite the bullet and footed some more money on the server farm. But once Forum started to behave a bit better, more people come back to it (we had nearly 3K guests yesterday, where the average on the last months were between 1.200 to 1.800) and the problem started to happen again by obvious reasons.

Edited by Lisias
POST POST EDIT
Link to comment
Share on other sites

On 12/17/2024 at 6:46 PM, SunlitZelkova said:

I’ve been getting so many bad gateway messages I didn’t realize the forum had been up at all since late October.

Oh I see. The forum server is a Gateway Computer from 2004. :D

 

On 12/17/2024 at 6:46 PM, SunlitZelkova said:

I’ve been getting so many bad gateway messages I didn’t realize the forum had been up at all since late October.

Oh I see. The forum server is a Gateway Computer from 2004. :D

 

On 12/17/2024 at 6:46 PM, SunlitZelkova said:

I’ve been getting so many bad gateway messages I didn’t realize the forum had been up at all since late October.

Oh I see. The forum server is a Gateway Computer from 2004. :D

@Lisias The triple posting is what I saw after, when I tried to post this, I got a Bad Gateway 502 for 5 minutes.

Edited by ColdJ
Link to comment
Share on other sites

Every time when somebody opens tries to open the forum

Spoiler

Fallout-Hacking.jpg

screenshot_04.png

The last forum software update from RobCo...

Try to use the Server #5 instead of Server #9.
It has an outdated version of UOS, and never accepts the updates.

 

Upd.

Spoiler

No such error code in the reference.

https://fallout.wiki/wiki/Unified_Operating_System#Termlink

Spoiler
Image Code Message attached Purpose Place where a terminal shows it
Network connection not found terminal error.jpg 0x03C663A1 Network connection not found

Dummy terminal inactive

The BIOS could not find the network link. Fallout 3 1 in Rivet City's midship deck, common room
Fallout: New Vegas 1 in a Camp Golf Tent, Camp McCarran tent, Securitron Vault and Tops 13th floor (room Yes-Man)
Primary source unavailable terminal error.jpg 0x0AABFF00 Primary power source unavailable

Check all cords and plugs for connection

Power cord / power supply unit failure. 2 in the Weather Monitoring Station
2 in Allied Technologies Offices
1 in the visitors center
1 in Las Vegas Boulevard Station
1 in the House Resort
Bad data cannot read terminal error.jpg 0x07F6BAAC Bad data. Cannot read A fatal error due to a

memory / network / storage failure.

Fallout 3 1 in the Evergreen Mills foundry
Fallout: New Vegas 11 in Vault 3 living quarters
No input device terminal error.jpg 0x00B636C6 No input device found

Connect a keyboard

The BIOS could not find the keyboard. Fallout 3 1 in Fort Independence's lower level and 1 in Evergreen Mills foundry
Fallout: New Vegas 2 in Vault 3's living quarters
Processor corrupt terminal error.jpg 0xFFFFF710 Processor Corru;xsfkleg,,g364[735]3__. Corrupt processor. Fallout 3 1 in Fort Independence's main level
Fallout: New Vegas1 in the Goodsprings Schoolhouse
No data storage terminal error.jpg 0xF141A013 No Data Storage Detected.

Check Tape Drive Connection.

No storage device present Fallout 3 1 in the Megaton Clinic
Fallout: New Vegas 1 on Tops 13th floor (room Yes-Man)
Bad sector boot block terminal error.jpg 0x357C5001 Bad Sectors Found In Boot Block.

Terminal Error.

Bad sector Fallout 3 1 in the Drainage Chamber near Broadcast Tower KT8
N/A 0xFA770171 Segmentation fault. Improper authorization or corrupted file string Automatron (add-on) RobCo Sales & Service Center public display terminal
N/A 0x0D890102 Boot sector invalid/corrupt. Corrupt or missing BIOS/MF Boot Agent Automatron (add-on) RobCo Sales & Service Center public display terminal
N/A 0xFFF11011 Memory fault. Unreadable memory access Automatron (add-on) RobCo Sales & Service Center public display terminal
N/A 0xC0001011 Critical failure. Unknown critical system failure Automatron (add-on) RobCo Sales & Service Center public display terminal
N/A 0x00001001 Critical failure. Unknown critical system failure Automatron (add-on) RobCo Sales & Service Center public display terminal
N/A 0xA001C007 Invalid request. Corrupted or non-privileged write attempt Fallout 76 Watoga Emergency Services complaint database terminal

 

But as the code 502 = 0x1F6 obiously belongs to the "External codes" section

Spoiler

External error

Image Code Message attached Purpose Place where a terminal shows it
N/A 0x081 Data loss detected. Lost data from a holotape Fallout 4 Chestnut Hillock Reservoir, Edwin's terminal
N/A 0x104 Manual controls are disabled while Nuka-Galaxy is in AUTOMATIC Ride Mode. Manual controls were attempted while ride was in Automatic Mode Nuka-World (add-on) Nuka-Galaxy, control terminal
N/A 0x107 Catastrophic physical damage detected. Extensive damage to the external device Nuclear Winter (update) Vault 94, G.E.C.K. monitoring station terminal
N/A 0x109 A core meltdown is in progress. Complete or partial collapse of the nuclear core, such as with a Garden of Creation Kit Nuclear Winter (update) Vault 94, G.E.C.K. monitoring station terminal
N/A 0x117 Critical damage to mobility system. Unit is unable to move. Severe damage to legs or propulsion jet Nuka-World (add-on) RobCo Battlezone, Robotics diagnostics terminal
N/A 0x281 Critical damage to targeting system. While in combat, Frenzy]unit will attack random targets. Severe damage to Combat inhibitor Nuka-World (add-on) RobCo Battlezone, Robotics diagnostics terminal
N/A 0x391 Multiple malfunctions detected. The ride has been halted until safe conditions can be restored. Malfunctions from several sources have triggered a safety lockout Nuka-World (add-on) Nuka-Galaxy, control terminal
N/A 0x401 Unknown Unknown Nuclear Winter (update) Vault 94, G.E.C.K. monitoring station terminal
N/A 0x402 Unknown Unknown Nuclear Winter (update) Vault 94, G.E.C.K. monitoring station terminal
N/A 0x519 Unknown Unknown Nuclear Winter (update) Vault 94, G.E.C.K. monitoring station terminal
N/A 0x509 Unable to connect to the ABIS Biometrics ID Fabricator. Disconnected or malfunctioning automated biometric identification system Fallout 76 Missile Silo terminal entries, biometrics system terminal

, the problem is probably at the peripheral device side.

 

Probably, some gateway failure.

?imw=5000&imh=5000&ima=fit&impolicy=Lett

Maybe molerats, maybe metal scrap hunters.

 

Edited by kerbiloid
Link to comment
Share on other sites

On 12/25/2024 at 2:35 AM, ColdJ said:

@Lisias The triple posting is what I saw after, when I tried to post this, I got a Bad Gateway 502 for 5 minutes.

Yep. This is the proof that whatever is happening, is not Forum but something where Forum "lives", i.e., the infra from the server farm.

This is what happens:

  1. You post a new message
    1. Forum receives it, process it, and sends the response with the new page
    2. But the response is lost, and someone in the middle sends you a 502
  2. Then you immediately post again a message
    1. Forum receives it, process it (in this case, merging with the last message you sent), and sends the response with the page updated.
    2. But the response is lost, and someone in the middle sends you a 502 again
  3. Rinse, repeat.

If Forum itself would be screwed (as it really happened months ago), you would not be being able to post anything - what to say merging previous posts. So Forum is healthy, and it's probably healthy for months. The trouble maker is someone above Forum and below CloudFlare, what strongly suggests the server farm is the responsible for the problem somehow.

From my experience with AWS, I know that it's usual to impose quotas for the outbound traffic while making inbound traffic free. IF AND ONLY IF the server farm hosting Forum does the same, my best guess is, indeed, a too drastic costs reduction as I had proposed initially - but not on the number of servers hosting Forum, but on the very outbound traffic quota (what I think should be cheaper than the servers, but - obviously - I was wrong on this detail).

 

2 hours ago, kerbiloid said:

Every time when somebody opens tries to open the forum

Well... It's not too different from what I was doing initially, but then I built some logging tools and left this suffering behind! :D

Link to comment
Share on other sites

I live in Mountain Time USA

Between 7pm until 10pm I have a very very hard time connecting to the forum. It will initially connect but any link will produce the error. Refreshing the page a few times will have varied success, but i get nothing after the initial load when trying to redirect somewhere deeper into the forum. Other than that i have had little issues. The connectivity in general "seems" better but i do have almost a complete forum black out at certain times of night.

This would support the theory of max outbound (in my opinion) depending on when the fiscal refresh occurs on that metric, my problem being near the end of MY day. 

Edited by Fizzlebop Smith
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...