Jump to content

Bad Gateway,Possible End,Theories,Solutions.


ColdJ

Recommended Posts

Trying to view the forums at 06:10 am GMT +10.

Each attempt to open a page takes multiple tries, with Bad Gateway 502 popping up over and over.

Link to comment
Share on other sites

1 hour ago, ColdJ said:

Trying to view the forums at 06:10 am GMT +10.

Each attempt to open a page takes multiple tries, with Bad Gateway 502 popping up over and over.

Now that I had stabilize the scrapping tool, I ended up being also a nice watch dog!

I'm running it since 2024-07-27 18:07:36 GMT-3 continuously, and this is what I got:

  • From 2024-07-27 18:07:36 to 2024-07-28 13:45:36
    • nominal
  • From 2024-07-28 13:46:24 to 2024-07-28 14:14:36
    • severe turbulence
  • From 2024-07-28 14:18:36 to 2024-07-28 15:04:36
    • flawless
  • From 2024-07-28 15:05:36 to 2024-07-28 16:41:35
    • severe turbulence
  • From 2024-07-28 16:41:36 to 2024-07-28 17:54:36
    • some turbulence
  • From 2024-07-28 18:00:36 to 2024-07-28 18:32:36 (right now)
    • nominal

I'm extracting about 60 pages/minute, so the data above is pretty accurate without false negatives.

You know, there's interesting data that can be extracted from these log. Next interaction (202408) I will preserve the logs...

Link to comment
Share on other sites

I keep getting constant 502 "Bad Gateway" errors on both my phone and computer, and they're getting more and more frequent. Is this an issue with the forums or just me?

Link to comment
Share on other sites

On 7/31/2024 at 2:55 PM, Kerbalsaurus said:

I keep getting constant 502 "Bad Gateway" errors on both my phone and computer, and they're getting more and more frequent. Is this an issue with the forums or just me?

I'm not scrapping this week (I'm auditing and cleaning up what I have), so I can't be sure about the current time frames, but I have the feeling that my last post about is still valid.

=== == = POST EDIT = == ===

This is a small, crude but still useful report about the occurrences of HTTP 5xx on my scrapping reports, by WARC file.

forum.kerbalspaceprogram.com-00000.warc
     24 HTTP/1.1 500 Internal Server Error
      5 HTTP/1.1 522
forum.kerbalspaceprogram.com-00001.warc
forum.kerbalspaceprogram.com-20240713061810129595.a.warc
forum.kerbalspaceprogram.com-20240713061810129595.b.warc
forum.kerbalspaceprogram.com-20240713124446675422.warc
    476 HTTP/1.1 502 Bad Gateway
    253 HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
     18 HTTP/1.1 504 GATEWAY_TIMEOUT
forum.kerbalspaceprogram.com-20240715163142475518.warc
    363 HTTP/1.1 502 Bad Gateway
    202 HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
     54 HTTP/1.1 504 GATEWAY_TIMEOUT
forum.kerbalspaceprogram.com-20240719043104692502.warc
    100 HTTP/1.1 502 Bad Gateway
     63 HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
     19 HTTP/1.1 504 GATEWAY_TIMEOUT
forum.kerbalspaceprogram.com-20240726154803.warc
forum.kerbalspaceprogram.com-20240726163221.warc
      1 HTTP/1.1 504 GATEWAY_TIMEOUT
forum.kerbalspaceprogram.com-20240726183513.warc
forum.kerbalspaceprogram.com-20240726204605345405.warc
forum.kerbalspaceprogram.com-20240730071003571022.warc

The huge numbers are timestamps, in the format YYYYMMDDHHmmssllllll, where:

  • YYYY == year 4 digits
  • MM == month 2 digits
  • DD == day 2 digits
  • HH == hour 2 digits
  • mm == minutes 2 digits
  • ss == seconds 2 digits
  • llllll == milliseconds, 6 digits.

The timestamps tells when that scrapping session started, but says nothing about the duration so they are not that useful (proper reports are in the works), but even from that incomplete and lacking information we can conclude that:

  1. These pesky problems were already happening last year in the 2023.05 (essentially, one year ago) scrap I found on Archive Org
    1. So, really, all of these are old news
  2. They do not happen all the time, there're time frames in which they usually happen
    1. suggesting they are the collateral effect of some automated process

=== == = POST² EDIT = == ===

People, this is the breakdown of the HTTP 5xx errors I got from July 13 to 30th (GMT-3):

report_http5_by_hour.png

This does not represents the totality of the HTTP 5xx errors of the site, only the ones that I got while scraping the Forum at 20 to 60 pages/minute rate.  Most of the occurrences happened between 16:00 and 21:00, most of the time from 16:00 to 19:00. GMT-3.

I don't have an explanation for the anomaly on July 13th, but surely my mishaps on trying to find the best delays on the scrapper played a role on it.

As usual, all the source code I used is available on:

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports

But not the WARC files, I have about 95GB of dirty data and, frankly, I'm not in the mood to upload all that crap. :)

Spoiler
1.5G    forum.kerbalspaceprogram.com-20240713061810129595.bkp2.warc
221M    forum.kerbalspaceprogram.com-20240713061810129595.bkp.warc
1.5G    forum.kerbalspaceprogram.com-20240713061810129595.warc
1.3G    forum.kerbalspaceprogram.com-20240713081235884757.bkp.warc
9.4G    forum.kerbalspaceprogram.com-20240713124446675422.warc
9.4G    forum.kerbalspaceprogram.com-20240715163142475518.warc
3.8G    forum.kerbalspaceprogram.com-20240719043104692502.warc
1.8M    forum.kerbalspaceprogram.com-20240726154803.warc
5.7M    forum.kerbalspaceprogram.com-20240726163221.warc
832K    forum.kerbalspaceprogram.com-20240726183513.warc
2.6M    forum.kerbalspaceprogram.com-20240726200253887219.warc
8.8M    forum.kerbalspaceprogram.com-20240726204605345405.warc
9.4M    forum.kerbalspaceprogram.com-20240727134152841327.warc
8.9G    forum.kerbalspaceprogram.com-20240729205641826681.warc
142M    forum.kerbalspaceprogram.com-20240730071003571022.warc
9.3G    forum.kerbalspaceprogram.com-20240730081005118392.warc
9.4G    forum.kerbalspaceprogram.com-20240730182741207135.warc
9.4G    forum.kerbalspaceprogram.com-20240731072212329973.warc
23G     forum.kerbalspaceprogram.com-202407.warc
7.2G    forum.kerbalspaceprogram.com-20240801011247007658.warc
40K     forum.kerbalspaceprogram.com-images-20240726154747415944.warc
7.3M    forum.kerbalspaceprogram.com-images-20240726163134595018.warc
140K    forum.kerbalspaceprogram.com-images-20240726183456926078.warc
3.2M    forum.kerbalspaceprogram.com-images-20240726200237332251.warc
7.9M    forum.kerbalspaceprogram.com-images-20240727134041011275.warc
19M     forum.kerbalspaceprogram.com-media-20240713061810129595.warc
607M    forum.kerbalspaceprogram.com-media-20240713124446675422.warc
256M    forum.kerbalspaceprogram.com-media-20240715163142475518.warc
95G     total

@Deddly, @Vanamonde, I hope this information can be useful somehow. The CSV files (as well a Open Office spreadsheet) is available on the link above.

Edited by Lisias
POST² EDIT
Link to comment
Share on other sites

14 hours ago, kerbiloid said:

Is it known, where are the forum servers physically standing geographically?

Most likely in a room somewhere.    We highly doubt they would have survived this long out in a field unattended. 

 

Link to comment
Share on other sites

3 hours ago, Gargamel said:

somewhere

I mean to see what are they doing at those moments.

Maybe it somehow depends on the lunar tides. High water - server is underwater, low water - server is available.

***

https://en.wikipedia.org/wiki/On_the_Beach_(2000_film)

Quote

The source of the automated digital broadcast is traced to a television station whose broadcast, Towers and his executive officer discover, comes from a solar-powered laptop trying to broadcast a documentary via satellite.

 

Maybe like in the novel, the wind is moving the window there and back, switching the server on/off.

Link to comment
Share on other sites

On 8/12/2024 at 9:38 AM, ColdJ said:

2 really bad blackouts in the last 48 hours.

And one this (GMT-3) Morning, starting from 08:43 (GMT-3) and still ongoing. I detected a pretty small improvement at 9:00 sharp but it last less then 5 minutes. I noticed another small improvement at 09:59 but it lasted only 2 minutes.

So I stopped all my activities at 11:20 GMT-3 fearing I could be responsible for the problem somehow, so I don't have any more information from that point. Unfortunately (as if I would be involved on the mess, there would be something I could do) nothing changed.

Given the time stamps in which I noticed that small improvement, I'm guessing one of the triggers for the problem is some quota being exhausted, and my timings above strongly suggests this quota is renewed hourly.

I will publish a small report later, but it will not provide us with better information that I already summarize above.

Link to comment
Share on other sites

@Lisias 4 bad blackouts in 3 days now. It took me 2 hours to get on the forums just now.

I also notice that I am getting 2 different servers giving me 502 bad gateway the second being Cloudfare.

Link to comment
Share on other sites

1 hour ago, ColdJ said:

I also notice that I am getting 2 different servers giving me 502 bad gateway the second being Cloudfare.

Strongly suggesting the problem is something both of them use, as a database. My current working theory is the contracted database's hourly quota is burning up pretty quickly.

I think someone needs to reconfigure CloudFlare to limit the allowed number of requests by minute. I have word that it's currently 1500 hits per minute, and this appears to be too much. I would try setting it to 500, and if still failing, 250 and later 125. I would not lower this value below it as a lot of people shares an IP due Carrier Graded NAT (or VPNs). From this point, I think some other measure would be needed.

@Deddly, @Vanamonde, @Gargamel, there's any report you can access from the Forum Console where the accesses are logged? Of course, you can't share it to me due privacy concerns, but if you can provide me with a sample entry (I can hit forum with my company server's IP with CURL once, and then you send me that exactly log entry) , I can cook a small python or perl script and then you can run the logs on it and detect the offenders.

1 hour ago, ColdJ said:

@Lisias 4 bad blackouts in 3 days now. It took me 2 hours to get on the forums just now.

An idea just occurred to me. I detected that wiki.kerbalspaceprogram.com resolves to the same IP for me.

I will monitor wiki too, at the same time I will monitor (somehow) the forum. This will help on zeroing into the root cause.

=== == = ADDENDUM = == ==

AS 13:13 GMT-3, I'm not experiencing any http 5xx anymore.

=== == = ADDENDUM² = == ==

It's 13?37 GMT-3, and everything is screwed again. :/

Edited by Lisias
Entertaining grammars made slightely less entertaining...
Link to comment
Share on other sites

FYI

Report for the http 5xxx errors from last Sunday to today (until 16:20 Zulu). This doesn't reports all the ocurrences, only the ones I was monitoring. I had salvaged all the logs this time, I will further expand the reports with my own hits.

I noticed some changes on CloudFlare yesterday and before. It caused some changes on the numbers, but didn't changed the time the main batch of problems starts: more or less at 11:00 GMT-3 (14:00 Zulu). I think some other changes happened again early today or late yesterday, as the last 12 hours of the graph below (from 0:00 to 13:20 GMT-2) shows what appears to be an improvement.

 

report_http5_by_hour.2.png?raw=true

Source: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports

Link to comment
Share on other sites

5 hours ago, Kerbalsaurus said:

Their starting to become less common for me, only happening between 12pm to 2pm EST.

Lately, I've been getting them around that time (in CDT), but also around 5-6pm CDT. 

Link to comment
Share on other sites

FYI

Some graphs with the (perceived) Forum status in the last 7 days (GMT-3 timezone):

20240825.Events.png

HTTP 429 are harmless, it's Forum telling me that I'm hitting it too much triggering embargos from the scrapping tool. The nasty stuff are the 5xx ones.

Since I first implemented the 429 delays last weekend, the 5xx incidence dropped drastically except between Aug 21th, 13:00 (GMT-3) and Aug 24th 8:00. Don't have a clue about the reason.

Columns without any bars are timestamps without any occurrences other than HTTP 2xx or 3xx. :)

Since I implemented embargos for these events (60 seconds minimum for a 429, 900 for a 5xx, with logarithmic backoff), these charts are not accurate as after receiving 2 or 3 5xx consecutively, I may embargo scrapping for up to 2 hours and I don't know what happened in the mean time (Forum may be working, or may be issuing new bad gateway events - I just don't know).

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports

Edited by Lisias
Hit "Save" too soon
Link to comment
Share on other sites

FYI

At 2024-0827, 7:00 to 9:00 GMT-0 approximately, I'm getting trouble trying to access the Forum due that pesky http 5xx errors.

Personal scrapping are not the problem - I'm voluntarily backpaddling once I receive a 429 and any 5xx, and yet Forum is borking. If the same measures being applied to me are also being applied to everybody else, scrapping is not the issue.

20240827.Events.png

429 are harmless because I backpaddle on every one I receive, but they hint Forum is getting hammered somehow - the others, the 5xx ones, are the problem.

20240827.Connections.png

The charts above make clear that personal scrappings are not the cause of the downages. Whatever is being done to limit us, is not working - what strongly suggests that the problem is being misdiagnosed and mishandled.

=== == = POST EDIT = == ===

Apparently things came back to normal at 9:30 Z

Edited by Lisias
POST EDIT
Link to comment
Share on other sites

FYI

The borkage is still ongoing, and getting worst:

20240829.Events.png?raw=true

at the same time personal access to the Forum is being handicapped (perhaps on an attempt to solve the problem? If yes, it's not working, sadly):

20240829.Connections.png?raw=true

Whatever is happening

  1. It's not directly connected to the http 429 events, as the 5xx continued to happen even after I started to monitor the former (at Aug 24th) and self-restrict when detecting them.
  2. It's getting worse over time, besides our best efforts to limit personal scrappings.
    1. Corollary: personal scrappings were never the problem.
      1. Correlation does not imply causation!

From this point, I don't plan to further pesky you guys with such data. The problem is not being solved neither mitigated, and it may had ended up hurting my own efforts - without noticeable benefits.

--- -- - POST EDIT - -- ---

Explanations for the technically handicapped :P :

  • HTTP 429 Too Much Requests is a message the server sends to us, requesters, telling us that we are hitting it too much and it would appreciate if we back pedal a bit.
    • It's a HTTP 4xx series error message, meaning something made wrong by the client and to be fixed on the client side.
  • HTTP 5xx error messages are nasty errors that happened on the server side. Even when indirectly caused by user action, it is considered a server error because the server is the source of the problem. Examples of such errors we are facing lately:
    • HTTP 500 Internal Server Error :
      • The request passed trough the stack, but something inside the worker broke (like an unhandled Exception) in an unexpected way and the process crashed.
    • HTTP 502 Bad Gateway :
      • Something got broken when the frontend (nginx, Apache or IIS - the process that acts like a telephonist, distributing the requests to the ones that will handle them) received the request and tried to pass it to whoever was responsible to really handle it (we call these backend processes), but there was nobody listening (usually, the worker process - daemon - had crashed, or sometimes in the middle of the restart).
    • HTTP 503 Service Unavailable:
      • Someone (or something) deactivated the backend, and the frontend tried to reach the worker nevertheless and (obviously) couldn't.
    • HTTP 504 Gateway Timeout:
      • The frontend managed to send the request to the backend, but it didn't answered back after a defined amount of time (the timeout). This usually means that something inside the backend is overloaded and taking too much time to finish its tasks, and by the time it finally does it, there're no one expecting it anymore on the other side.
        • It's the worst kind, IMHO, because the current timeout problem incurred on expenses on the backend, and also contributes to it happening again when the problem is quota exhaustion (because, well, it consumes quota without providing any results, prompting the caller to ask for it again, in a self feeding vicious cycle).

=== == = POST EDIT2 = == ==

Updated repo with data from this afternoon. We had more 5xx events.

Edited by Lisias
POST EDIT2 -- some corrections and addenda
Link to comment
Share on other sites

  • 2 weeks later...

Not willing to beat a dead horse, but I found this on my feed today:

'This was essentially a two-week long DDoS attack': Game UI Database slowdown caused by relentless OpenAI scraping

https://www.gamedeveloper.com/business/-this-was-essentially-a-two-week-long-ddos-attack-game-ui-database-slowdown-caused-by-openai-scraping

Personal scrappers weren't the problem, as the Bad Gateways kept happening and happening even when the scrapper were imposing themself temporary moratoria (or plain stopped doing it at all some weeks).

The real cause was left unchecked.

Link to comment
Share on other sites

On 7/9/2024 at 9:50 PM, ColdJ said:

If as I understand it, the company that hosts the KSP servers is used by over 40% of the web sites world wide and if I am right from stuff I have read that Chat GPT has access to the server. It may be that Chat GPT is inadvertently causing the problems by accessing the servers , looking for data, at a rate far greater than humans alone would be doing. And these high speed and numerous calls to the servers are overwelming them. When they were first established they would not have been thinking of the servers being accessed as fast as these AI programs do."

 

I am surprised that @Lisias didn't tag @Vanamonde@Deddly and @Gargamel to make sure they read the info on that link.

Quote from the article.

He explained the disruption eventually caused the website to stop working altogether, dishing out "502 Bad Gateway" errors to users. At that stage, Coates sought the help of Jay Peet, who hosted the database on their private server for the last five years. Peet looked at the site logs and realized the website's resources were being swallowed by a single IP address belonging to OpenAI.

"The homepage was being reloaded 200 times a second, as the [OpenAI] bot was apparently struggling to find its way around the site and getting stuck in a continuous loop," added Coates. "This was essentially a two-week long DDoS attack in the form of a data heist."

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...