Gargamel Posted July 28 Share Posted July 28 13 hours ago, Lisias said: he apparently made a mistake on trying to explain why. No... No I didn't. I said exactly what I meant to say. Quote Link to comment Share on other sites More sharing options...
ColdJ Posted July 28 Author Share Posted July 28 Trying to view the forums at 06:10 am GMT +10. Each attempt to open a page takes multiple tries, with Bad Gateway 502 popping up over and over. Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 28 Share Posted July 28 1 hour ago, ColdJ said: Trying to view the forums at 06:10 am GMT +10. Each attempt to open a page takes multiple tries, with Bad Gateway 502 popping up over and over. Now that I had stabilize the scrapping tool, I ended up being also a nice watch dog! I'm running it since 2024-07-27 18:07:36 GMT-3 continuously, and this is what I got: From 2024-07-27 18:07:36 to 2024-07-28 13:45:36 nominal From 2024-07-28 13:46:24 to 2024-07-28 14:14:36 severe turbulence From 2024-07-28 14:18:36 to 2024-07-28 15:04:36 flawless From 2024-07-28 15:05:36 to 2024-07-28 16:41:35 severe turbulence From 2024-07-28 16:41:36 to 2024-07-28 17:54:36 some turbulence From 2024-07-28 18:00:36 to 2024-07-28 18:32:36 (right now) nominal I'm extracting about 60 pages/minute, so the data above is pretty accurate without false negatives. You know, there's interesting data that can be extracted from these log. Next interaction (202408) I will preserve the logs... Quote Link to comment Share on other sites More sharing options...
kspbutitscursed Posted July 30 Share Posted July 30 interesting thread Quote Link to comment Share on other sites More sharing options...
Kerbalsaurus Posted July 31 Share Posted July 31 I keep getting constant 502 "Bad Gateway" errors on both my phone and computer, and they're getting more and more frequent. Is this an issue with the forums or just me? Quote Link to comment Share on other sites More sharing options...
HebaruSan Posted July 31 Share Posted July 31 Everyone is affected, unfortunately. Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 5 Share Posted August 5 (edited) On 7/31/2024 at 2:55 PM, Kerbalsaurus said: I keep getting constant 502 "Bad Gateway" errors on both my phone and computer, and they're getting more and more frequent. Is this an issue with the forums or just me? I'm not scrapping this week (I'm auditing and cleaning up what I have), so I can't be sure about the current time frames, but I have the feeling that my last post about is still valid. === == = POST EDIT = == === This is a small, crude but still useful report about the occurrences of HTTP 5xx on my scrapping reports, by WARC file. forum.kerbalspaceprogram.com-00000.warc 24 HTTP/1.1 500 Internal Server Error 5 HTTP/1.1 522 forum.kerbalspaceprogram.com-00001.warc forum.kerbalspaceprogram.com-20240713061810129595.a.warc forum.kerbalspaceprogram.com-20240713061810129595.b.warc forum.kerbalspaceprogram.com-20240713124446675422.warc 476 HTTP/1.1 502 Bad Gateway 253 HTTP/1.1 503 Service Unavailable: Back-end server is at capacity 18 HTTP/1.1 504 GATEWAY_TIMEOUT forum.kerbalspaceprogram.com-20240715163142475518.warc 363 HTTP/1.1 502 Bad Gateway 202 HTTP/1.1 503 Service Unavailable: Back-end server is at capacity 54 HTTP/1.1 504 GATEWAY_TIMEOUT forum.kerbalspaceprogram.com-20240719043104692502.warc 100 HTTP/1.1 502 Bad Gateway 63 HTTP/1.1 503 Service Unavailable: Back-end server is at capacity 19 HTTP/1.1 504 GATEWAY_TIMEOUT forum.kerbalspaceprogram.com-20240726154803.warc forum.kerbalspaceprogram.com-20240726163221.warc 1 HTTP/1.1 504 GATEWAY_TIMEOUT forum.kerbalspaceprogram.com-20240726183513.warc forum.kerbalspaceprogram.com-20240726204605345405.warc forum.kerbalspaceprogram.com-20240730071003571022.warc The huge numbers are timestamps, in the format YYYYMMDDHHmmssllllll, where: YYYY == year 4 digits MM == month 2 digits DD == day 2 digits HH == hour 2 digits mm == minutes 2 digits ss == seconds 2 digits llllll == milliseconds, 6 digits. The timestamps tells when that scrapping session started, but says nothing about the duration so they are not that useful (proper reports are in the works), but even from that incomplete and lacking information we can conclude that: These pesky problems were already happening last year in the 2023.05 (essentially, one year ago) scrap I found on Archive Org So, really, all of these are old news They do not happen all the time, there're time frames in which they usually happen suggesting they are the collateral effect of some automated process === == = POST² EDIT = == === People, this is the breakdown of the HTTP 5xx errors I got from July 13 to 30th (GMT-3): This does not represents the totality of the HTTP 5xx errors of the site, only the ones that I got while scraping the Forum at 20 to 60 pages/minute rate. Most of the occurrences happened between 16:00 and 21:00, most of the time from 16:00 to 19:00. GMT-3. I don't have an explanation for the anomaly on July 13th, but surely my mishaps on trying to find the best delays on the scrapper played a role on it. As usual, all the source code I used is available on: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports But not the WARC files, I have about 95GB of dirty data and, frankly, I'm not in the mood to upload all that crap. Spoiler 1.5G forum.kerbalspaceprogram.com-20240713061810129595.bkp2.warc 221M forum.kerbalspaceprogram.com-20240713061810129595.bkp.warc 1.5G forum.kerbalspaceprogram.com-20240713061810129595.warc 1.3G forum.kerbalspaceprogram.com-20240713081235884757.bkp.warc 9.4G forum.kerbalspaceprogram.com-20240713124446675422.warc 9.4G forum.kerbalspaceprogram.com-20240715163142475518.warc 3.8G forum.kerbalspaceprogram.com-20240719043104692502.warc 1.8M forum.kerbalspaceprogram.com-20240726154803.warc 5.7M forum.kerbalspaceprogram.com-20240726163221.warc 832K forum.kerbalspaceprogram.com-20240726183513.warc 2.6M forum.kerbalspaceprogram.com-20240726200253887219.warc 8.8M forum.kerbalspaceprogram.com-20240726204605345405.warc 9.4M forum.kerbalspaceprogram.com-20240727134152841327.warc 8.9G forum.kerbalspaceprogram.com-20240729205641826681.warc 142M forum.kerbalspaceprogram.com-20240730071003571022.warc 9.3G forum.kerbalspaceprogram.com-20240730081005118392.warc 9.4G forum.kerbalspaceprogram.com-20240730182741207135.warc 9.4G forum.kerbalspaceprogram.com-20240731072212329973.warc 23G forum.kerbalspaceprogram.com-202407.warc 7.2G forum.kerbalspaceprogram.com-20240801011247007658.warc 40K forum.kerbalspaceprogram.com-images-20240726154747415944.warc 7.3M forum.kerbalspaceprogram.com-images-20240726163134595018.warc 140K forum.kerbalspaceprogram.com-images-20240726183456926078.warc 3.2M forum.kerbalspaceprogram.com-images-20240726200237332251.warc 7.9M forum.kerbalspaceprogram.com-images-20240727134041011275.warc 19M forum.kerbalspaceprogram.com-media-20240713061810129595.warc 607M forum.kerbalspaceprogram.com-media-20240713124446675422.warc 256M forum.kerbalspaceprogram.com-media-20240715163142475518.warc 95G total @Deddly, @Vanamonde, I hope this information can be useful somehow. The CSV files (as well a Open Office spreadsheet) is available on the link above. Edited August 8 by Lisias POST² EDIT Quote Link to comment Share on other sites More sharing options...
Deddly Posted August 6 Share Posted August 6 That's some great work right there! Quote Link to comment Share on other sites More sharing options...
ColdJ Posted August 12 Author Share Posted August 12 2 really bad blackouts in the last 48 hours. Quote Link to comment Share on other sites More sharing options...
kerbiloid Posted August 12 Share Posted August 12 Is it known, where are the forum servers physically standing geographically? Quote Link to comment Share on other sites More sharing options...
Gargamel Posted August 13 Share Posted August 13 14 hours ago, kerbiloid said: Is it known, where are the forum servers physically standing geographically? Most likely in a room somewhere. We highly doubt they would have survived this long out in a field unattended. Quote Link to comment Share on other sites More sharing options...
kerbiloid Posted August 13 Share Posted August 13 3 hours ago, Gargamel said: somewhere I mean to see what are they doing at those moments. Maybe it somehow depends on the lunar tides. High water - server is underwater, low water - server is available. *** https://en.wikipedia.org/wiki/On_the_Beach_(2000_film) Quote The source of the automated digital broadcast is traced to a television station whose broadcast, Towers and his executive officer discover, comes from a solar-powered laptop trying to broadcast a documentary via satellite. Maybe like in the novel, the wind is moving the window there and back, switching the server on/off. Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 13 Share Posted August 13 On 8/12/2024 at 9:38 AM, ColdJ said: 2 really bad blackouts in the last 48 hours. And one this (GMT-3) Morning, starting from 08:43 (GMT-3) and still ongoing. I detected a pretty small improvement at 9:00 sharp but it last less then 5 minutes. I noticed another small improvement at 09:59 but it lasted only 2 minutes. So I stopped all my activities at 11:20 GMT-3 fearing I could be responsible for the problem somehow, so I don't have any more information from that point. Unfortunately (as if I would be involved on the mess, there would be something I could do) nothing changed. Given the time stamps in which I noticed that small improvement, I'm guessing one of the triggers for the problem is some quota being exhausted, and my timings above strongly suggests this quota is renewed hourly. I will publish a small report later, but it will not provide us with better information that I already summarize above. Quote Link to comment Share on other sites More sharing options...
ColdJ Posted August 13 Author Share Posted August 13 @Lisias 4 bad blackouts in 3 days now. It took me 2 hours to get on the forums just now. I also notice that I am getting 2 different servers giving me 502 bad gateway the second being Cloudfare. Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 13 Share Posted August 13 (edited) 1 hour ago, ColdJ said: I also notice that I am getting 2 different servers giving me 502 bad gateway the second being Cloudfare. Strongly suggesting the problem is something both of them use, as a database. My current working theory is the contracted database's hourly quota is burning up pretty quickly. I think someone needs to reconfigure CloudFlare to limit the allowed number of requests by minute. I have word that it's currently 1500 hits per minute, and this appears to be too much. I would try setting it to 500, and if still failing, 250 and later 125. I would not lower this value below it as a lot of people shares an IP due Carrier Graded NAT (or VPNs). From this point, I think some other measure would be needed. @Deddly, @Vanamonde, @Gargamel, there's any report you can access from the Forum Console where the accesses are logged? Of course, you can't share it to me due privacy concerns, but if you can provide me with a sample entry (I can hit forum with my company server's IP with CURL once, and then you send me that exactly log entry) , I can cook a small python or perl script and then you can run the logs on it and detect the offenders. 1 hour ago, ColdJ said: @Lisias 4 bad blackouts in 3 days now. It took me 2 hours to get on the forums just now. An idea just occurred to me. I detected that wiki.kerbalspaceprogram.com resolves to the same IP for me. I will monitor wiki too, at the same time I will monitor (somehow) the forum. This will help on zeroing into the root cause. === == = ADDENDUM = == == AS 13:13 GMT-3, I'm not experiencing any http 5xx anymore. === == = ADDENDUM² = == == It's 13?37 GMT-3, and everything is screwed again. Edited August 13 by Lisias Entertaining grammars made slightely less entertaining... Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 14 Share Posted August 14 FYI Report for the http 5xxx errors from last Sunday to today (until 16:20 Zulu). This doesn't reports all the ocurrences, only the ones I was monitoring. I had salvaged all the logs this time, I will further expand the reports with my own hits. I noticed some changes on CloudFlare yesterday and before. It caused some changes on the numbers, but didn't changed the time the main batch of problems starts: more or less at 11:00 GMT-3 (14:00 Zulu). I think some other changes happened again early today or late yesterday, as the last 12 hours of the graph below (from 0:00 to 13:20 GMT-2) shows what appears to be an improvement. Source: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports Quote Link to comment Share on other sites More sharing options...
Kerbalsaurus Posted August 20 Share Posted August 20 Their starting to become less common for me, only happening between 12pm to 2pm EST. Quote Link to comment Share on other sites More sharing options...
Grenartia Posted August 21 Share Posted August 21 5 hours ago, Kerbalsaurus said: Their starting to become less common for me, only happening between 12pm to 2pm EST. Lately, I've been getting them around that time (in CDT), but also around 5-6pm CDT. Quote Link to comment Share on other sites More sharing options...
ColdJ Posted August 21 Author Share Posted August 21 Keeps happening at seemingly random times for me. Quote Link to comment Share on other sites More sharing options...
calabus2 Posted August 21 Share Posted August 21 Noticed at 1:27 pm EST I was getting 502/503 errors as usual. After 10 minutes of retrying I gave up. At 1:52 the page was loading again. Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 25 Share Posted August 25 (edited) FYI Some graphs with the (perceived) Forum status in the last 7 days (GMT-3 timezone): HTTP 429 are harmless, it's Forum telling me that I'm hitting it too much triggering embargos from the scrapping tool. The nasty stuff are the 5xx ones. Since I first implemented the 429 delays last weekend, the 5xx incidence dropped drastically except between Aug 21th, 13:00 (GMT-3) and Aug 24th 8:00. Don't have a clue about the reason. Columns without any bars are timestamps without any occurrences other than HTTP 2xx or 3xx. Since I implemented embargos for these events (60 seconds minimum for a 429, 900 for a 5xx, with logarithmic backoff), these charts are not accurate as after receiving 2 or 3 5xx consecutively, I may embargo scrapping for up to 2 hours and I don't know what happened in the mean time (Forum may be working, or may be issuing new bad gateway events - I just don't know). https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports Edited August 28 by Lisias Hit "Save" too soon Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 27 Share Posted August 27 (edited) FYI At 2024-0827, 7:00 to 9:00 GMT-0 approximately, I'm getting trouble trying to access the Forum due that pesky http 5xx errors. Personal scrapping are not the problem - I'm voluntarily backpaddling once I receive a 429 and any 5xx, and yet Forum is borking. If the same measures being applied to me are also being applied to everybody else, scrapping is not the issue. 429 are harmless because I backpaddle on every one I receive, but they hint Forum is getting hammered somehow - the others, the 5xx ones, are the problem. The charts above make clear that personal scrappings are not the cause of the downages. Whatever is being done to limit us, is not working - what strongly suggests that the problem is being misdiagnosed and mishandled. === == = POST EDIT = == === Apparently things came back to normal at 9:30 Z Edited August 28 by Lisias POST EDIT Quote Link to comment Share on other sites More sharing options...
Lisias Posted August 29 Share Posted August 29 (edited) FYI The borkage is still ongoing, and getting worst: at the same time personal access to the Forum is being handicapped (perhaps on an attempt to solve the problem? If yes, it's not working, sadly): Whatever is happening It's not directly connected to the http 429 events, as the 5xx continued to happen even after I started to monitor the former (at Aug 24th) and self-restrict when detecting them. It's getting worse over time, besides our best efforts to limit personal scrappings. Corollary: personal scrappings were never the problem. Correlation does not imply causation! From this point, I don't plan to further pesky you guys with such data. The problem is not being solved neither mitigated, and it may had ended up hurting my own efforts - without noticeable benefits. --- -- - POST EDIT - -- --- Explanations for the technically handicapped : HTTP 429 Too Much Requests is a message the server sends to us, requesters, telling us that we are hitting it too much and it would appreciate if we back pedal a bit. It's a HTTP 4xx series error message, meaning something made wrong by the client and to be fixed on the client side. HTTP 5xx error messages are nasty errors that happened on the server side. Even when indirectly caused by user action, it is considered a server error because the server is the source of the problem. Examples of such errors we are facing lately: HTTP 500 Internal Server Error : The request passed trough the stack, but something inside the worker broke (like an unhandled Exception) in an unexpected way and the process crashed. HTTP 502 Bad Gateway : Something got broken when the frontend (nginx, Apache or IIS - the process that acts like a telephonist, distributing the requests to the ones that will handle them) received the request and tried to pass it to whoever was responsible to really handle it (we call these backend processes), but there was nobody listening (usually, the worker process - daemon - had crashed, or sometimes in the middle of the restart). HTTP 503 Service Unavailable: Someone (or something) deactivated the backend, and the frontend tried to reach the worker nevertheless and (obviously) couldn't. HTTP 504 Gateway Timeout: The frontend managed to send the request to the backend, but it didn't answered back after a defined amount of time (the timeout). This usually means that something inside the backend is overloaded and taking too much time to finish its tasks, and by the time it finally does it, there're no one expecting it anymore on the other side. It's the worst kind, IMHO, because the current timeout problem incurred on expenses on the backend, and also contributes to it happening again when the problem is quota exhaustion (because, well, it consumes quota without providing any results, prompting the caller to ask for it again, in a self feeding vicious cycle). === == = POST EDIT2 = == == Updated repo with data from this afternoon. We had more 5xx events. Edited August 30 by Lisias POST EDIT2 -- some corrections and addenda Quote Link to comment Share on other sites More sharing options...
Lisias Posted September 9 Share Posted September 9 Not willing to beat a dead horse, but I found this on my feed today: 'This was essentially a two-week long DDoS attack': Game UI Database slowdown caused by relentless OpenAI scraping https://www.gamedeveloper.com/business/-this-was-essentially-a-two-week-long-ddos-attack-game-ui-database-slowdown-caused-by-openai-scraping Personal scrappers weren't the problem, as the Bad Gateways kept happening and happening even when the scrapper were imposing themself temporary moratoria (or plain stopped doing it at all some weeks). The real cause was left unchecked. Quote Link to comment Share on other sites More sharing options...
ColdJ Posted September 10 Author Share Posted September 10 On 7/9/2024 at 9:50 PM, ColdJ said: If as I understand it, the company that hosts the KSP servers is used by over 40% of the web sites world wide and if I am right from stuff I have read that Chat GPT has access to the server. It may be that Chat GPT is inadvertently causing the problems by accessing the servers , looking for data, at a rate far greater than humans alone would be doing. And these high speed and numerous calls to the servers are overwelming them. When they were first established they would not have been thinking of the servers being accessed as fast as these AI programs do." I am surprised that @Lisias didn't tag @Vanamonde@Deddly and @Gargamel to make sure they read the info on that link. Quote from the article. He explained the disruption eventually caused the website to stop working altogether, dishing out "502 Bad Gateway" errors to users. At that stage, Coates sought the help of Jay Peet, who hosted the database on their private server for the last five years. Peet looked at the site logs and realized the website's resources were being swallowed by a single IP address belonging to OpenAI. "The homepage was being reloaded 200 times a second, as the [OpenAI] bot was apparently struggling to find its way around the site and getting stuck in a continuous loop," added Coates. "This was essentially a two-week long DDoS attack in the form of a data heist." Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.