RayneCloud Posted July 9 Share Posted July 9 My desire to be directly involved was made clear to Vanamonde and I imagine the rest of the team, only because I thought I was the only one looking in to it at the time. Knowing there are others, work has already begun, and that things are moving forward with something that might be workable... brings me a lot of joy. There was no desire to hide anything from anyone, please grow up and stop seeing malice in everything. Quote Link to comment Share on other sites More sharing options...
guest10985 Posted July 9 Share Posted July 9 (edited) As of 11:35pm EST on July 8th 2024. The KSP forums has been backed up to the Wayback machine. I think, I hope. Edited July 9 by guest10985 havn't checked everything Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 9 Share Posted July 9 (edited) 3 hours ago, guest10985 said: As of 11:35pm EST on July 8th 2024. The KSP forums has been backed up to the Wayback machine. I think, I hope. Some pages are missing, just checked. 21 hours ago, Lisias said: There's the tool the Internet Archive uses: https://archive.org/developers/internetarchive/index.html Somewhat more complicated than httrack (what I had used for a decade, by the way), but it's how these guys do things. It also worths to mention that I found this too: https://archive.org/details/forum.kerbalspaceprogram.com_202305 Updated at May 2023 with this tool. I'm reading that documentation to see if we could use the 2023.5 dump as a baseline to save Forum's bandwidth. Additionally, I think it's a very good idea to have this dump shared between many people, again to save Forum's bandwidth - there's no need to every single user here to do the job by themself, we can collaborate. That internet archive tool doesn't works for what I intended (it was a CLI tool to query their database), but I found a fork from another Internet Archive tool that makes incremental updated if you give a previous backup, https://pywb.readthedocs.io/en/latest/manual/configuring.html It does backups in the very same way the Archive does. I think this is going to save a lot of bandwidth on the long run - I still hope we are overreacting, but as I said before: hoping for the best, but expecting the worst. Now I'm trying to find a way to allow people to salvage their private messages. This is going to be tricky, because I will need to provide some kind of application bundle (fortunately, Python Freeze is our friend) so people could just download the thing and run it after providing their credentials. The second challenge is managing to be transparent and auditable to give people confidence that would be safe to give the tool their credentials without nasty consequences. A message suggesting changing the password before and after using the tool would not hurt neither. @Vanamonde - there's any way to find what would be the best time to run the tool? There're any chart in the management console telling the hours in which the site has the smallest load during the day? Edited July 9 by Lisias Entertaining grammars made slightely less entertaining... Quote Link to comment Share on other sites More sharing options...
Deddly Posted July 9 Share Posted July 9 18 minutes ago, Lisias said: any chart in the management console telling the hours in which the site has the smallest load during the day? There is a statistics page, but it unfortunately doesn't properly break down activity according to time of day. However, in general, Saturday appears to be the quietest day. Quote Link to comment Share on other sites More sharing options...
Demcrew Posted July 9 Share Posted July 9 @Lisias - as I've mentionned on another topic, I'm willing to help. I've got a descent rig I can use to host a new forum if needed. it is enough powerfull, and has descent space as well. KSP and its community is something we need to protect and carry the best way we can. Keep me posted if anything turns this way. Quote Link to comment Share on other sites More sharing options...
Izny Posted July 9 Share Posted July 9 41 minutes ago, Demcrew said: @Lisias - as I've mentionned on another topic, I'm willing to help. I've got a descent rig I can use to host a new forum if needed. it is enough powerfull, and has descent space as well. KSP and its community is something we need to protect and carry the best way we can. Keep me posted if anything turns this way. Dont think a locally hosted rig, single point of failure, busfactor 1, is a great idea to be honest. Quote Link to comment Share on other sites More sharing options...
Deddly Posted July 9 Share Posted July 9 Yeah, it's a nice idea but a forum hosted on a single machine in someone's house would be susceptible to all sorts of security and stability risks, unfortunately. Quote Link to comment Share on other sites More sharing options...
Demcrew Posted July 9 Share Posted July 9 Agreed that a locally hosted rig is a single point of failure, To be fair and honest, I'm just offering what I can. Also forgot to mention that I'm an IT professional - been hosting variety of servers along the way and since a while now. This is just in case... you know this option exists Quote Link to comment Share on other sites More sharing options...
R-T-B Posted July 9 Share Posted July 9 (edited) 14 hours ago, PDCWolf said: This kind of mystery and subterfuge is exactly why people wanted to stay away from the KSP2 modding scene. All the more reason to have more than one backup and alternative. lol it's not subterfuge it's just friggin' discord discussions, and I'm the one attempting the backup if anyone is curious. It's going well thus far. I too possess a personal server but I wouldn't dare use it for a public facing forum of this scale. You'd want a hosted solution in a real datacenter somewhere. My sketchy forest internet is too sketch. Edited July 9 by R-T-B Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 9 Share Posted July 9 (edited) 1 hour ago, Izny said: Dont think a locally hosted rig, single point of failure, busfactor 1, is a great idea to be honest. <n> (where n > 1) federated hosts registered on a central http server, that does a temporary redirect (http 307) in a round robin to the registered mirrors. A daemon pings the known mirrors and remove them from the pool when offline, and puts them back when back online. I had did it in the past, really piece of cake. Again, we are not replacing the Forum, we are preserving a mirror of it in a Community effort for historical reference. We have still a point of failure, the central http server, but this one is incredibly easy to replace. And, yeah, any source code will be available in a OSI license. Anyone will be able to have their own "Federation", I'm not centralizing anything on me. 33 minutes ago, R-T-B said: I too possess a personal server but I wouldn't dare use it for a public facing forum of this scale. You'd want a hosted solution in a real datacenter somewhere. My sketchy forest internet is too sketch. Well, what I have in mind is a static and stateless mirror for historical reference. Distributed by many hosts from voluntaries, so if one goes down, there're many others to keep the thing online. Now, a full Forum replacement, this is another problem. First things first, however. 42 minutes ago, Demcrew said: Also forgot to mention that I'm an IT professional - been hosting variety of servers along the way and since a while now. This is just in case... you know this option exists Great. Edited July 9 by Lisias nope. http 307 is the way, I was initially correct. Quote Link to comment Share on other sites More sharing options...
Izny Posted July 9 Share Posted July 9 2 minutes ago, Lisias said: <n> (where n > 1) federated hosts registered on a central http server, that does a temporary redirect (http 302) in a round robin to the registered mirrors. A daemon pings the known mirrors and remove them from the pool when offline, and puts them back when back online. I had did it in the past, really piece of cake. Again, we are not replacing the Forum, we are preserving a mirror of it in a Community effort for historical reference. We have still a point of failure, the central http server, but this one is incredibly easy to replace. And, yeah, any source code will be available in a OSI license. Anyone will be able to have their own "Federation", I'm not centralizing anything on me. Well, what I have in mind is a static and stateless mirror for historical reference. Distributed by many hosts from voluntaries, so if one goes down, there're many others to keep the thing online. Now, a full Forum replacement, this is another problem. First things first, however. Great. That changes everything, for the better can you explain the round robin thing to the current host? it seems their traffic balancer points to non existing servers at peak hours Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 9 Share Posted July 9 (edited) 1 hour ago, Deddly said: Yeah, it's a nice idea but a forum hosted on a single machine in someone's house would be susceptible to all sorts of security and stability risks, unfortunately. Security will be a problem. Granted. 6 minutes ago, Izny said: can you explain the round robin thing to the current host? it seems their traffic balancer points to non existing servers at peak hours I may be wrong, but I think that TTI cut the number of servers for Forum. And the Cloudflare is not configured to cache html content (what would cause some esoteric misbehaviours on a Forum), so if people enough access the Forum at the same time, something breaks (probably the database) and we have a bad gateway due timeout, or due plain quota exhaustion. Edited July 9 by Lisias brute force post merge Quote Link to comment Share on other sites More sharing options...
PDCWolf Posted July 9 Share Posted July 9 (edited) 2 hours ago, R-T-B said: lol it's not subterfuge it's just friggin' discord discussions, and I'm the one attempting the backup if anyone is curious. It's going well thus far. I too possess a personal server but I wouldn't dare use it for a public facing forum of this scale. You'd want a hosted solution in a real datacenter somewhere. My sketchy forest internet is too sketch. You... might not be seeing the problem, and not just you from the looks of this thread. If you're doing anything with my posts in private (and yes, I'm using myself as an example but this applies to everyone), you're violating my intellectual property rights, which I only ever transferred to Take Two by posting here. And even though by posting here I gave them right to public display, that's their right to publicly display my post, not yours. This is exactly why I never talked about public facing forum alternatives or "backups". Not just because I have zero care for anything but the mods, but because that really is a huge legal issue if TakeTwo wants to make it one, or any of the posters for that matter. It is subterfuge in the literal sense because you were hiding the truth that you were doing this, in some private discord most people were definitely not invited to. What weight you give to that of course I don't mind, and it's not even super grave to me, but a cat is a cat. If you all wanna do private backups... you shouldn't be confirming, publicizing, or sharing anything about them. If you plan to do a public facing mirror of the forum, you're setting up for having to take it down the moment T2 farts your way. Edit: https://www.take2games.com/legal/en-US/ <- Here's where you can check, under UGC what rights every user has, and what rights have been transferred to TT (and not anyone else). Edited July 9 by PDCWolf Quote Link to comment Share on other sites More sharing options...
Gargamel Posted July 9 Share Posted July 9 (edited) If you are worried about people altering your posts, you can do yourself the favor of backing up your own posts, and then if an issue arises you can compare them to the way back machine. Which, as it happens, is already scraping the forums for its content and archiving it. If there were any valid legal arguments to be made, they would have been made. As for privacy, there is no personal or private account information available to members or the public at large in any public facing section of the forums. Any information that would be public facing was added by the user themselves with the intent of it being public facing (locations, website, etc info as found on the profile page). And as such, since any group that is scraping the forums for archive have no access anything that could validate an account (password, email, etc), there is no way they could use this data for anything but archives. If there is a new forum, it will not have ongoing threads available to post in. The old threads will be available for reference only. There would be no way to verify each account on a new forum is the same user as the account of the same name on this forum. Account names would have to be new and unconnected to the old ones. We also have an unwritten policy that drama that occurs outside the forums, say on discord or Reddit, should remain on discord and Reddit. Now, if we could get back to the issue of having members try to coordinate this effort, that would be much appreciated. Edited July 9 by Gargamel Quote Link to comment Share on other sites More sharing options...
PDCWolf Posted July 9 Share Posted July 9 2 hours ago, Gargamel said: If there were any valid legal arguments to be made, they would have been made. "No complaints yet" doesn't mean "no complaints possible", as the internet archive would've made clear with their recent loss of about half a million works due to rights disputes. That you'd turn that into hyperbole about "editing my posts" is completely unrelated, unlike the discussion of the rights of posts and the rights granted to T2 which is vital and completely on-topic to archival efforts. So I take this is probably bait to delete my answer as off-topic and warn me again or whatever. But hey, I'll bite, make it short and give you the pleasure: Posts made here, as per T2's UGC policy I'm giving my information through T2's services for public display, granting them, not anyone who'd make public backups, the right to display them. Thus it is against my rights to scrape my posts here and host them somewhere else. Now, I'm completely worthless in the grand scheme of things, but addon makers might have a word to say, which conjures up a lot of other possible legal issues. Lastly, I don't take your word as a warranty of anything take two might or might not do, as you yourselves have stated: you have no idea what's going on and barely any connection to T2, and you're all still here because the forum somehow still exists. If you don't think discussing the legal disputes that could arise from archival work is not part of the discussion of archival work, all I can do is laugh, and screenshot my post for when it disappears. Quote Link to comment Share on other sites More sharing options...
R-T-B Posted July 9 Share Posted July 9 (edited) 4 hours ago, Gargamel said: We also have an unwritten policy that drama that occurs outside the forums, say on discord or Reddit, should remain on discord and Reddit. Yeah, and I can understand that. The only reason discord was mentioned is because the idea was born there spitballing what to do with other community members... it was not my intent to exclude anyone, which is precisely why I am being open about it now. Discord may be where the archival idea was born for some of us but none of us really think it is a replacement for a traditional forum. We really do want forums to stay but well, if they don't, we want a plan b. Edited July 9 by R-T-B Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 9 Share Posted July 9 (edited) You have a point. I will not try to sweeten the pill on the subject, so I will just address the following point: 19 hours ago, PDCWolf said: If you all wanna do private backups... you shouldn't be confirming, publicizing, or sharing anything about them. If you plan to do a public facing mirror of the forum, you're setting up for having to take it down the moment T2 farts your way. Yes, and providing tools and teaching them to do that is my goal. However, doing that indiscriminately will hurt Forum, prompting someone to take down the initiative - so I decided to go WARC on the thing, so we can share between us the archives saving Forum's some bucks in bandwidth. Additionally, since anyone can do its own archive and compare the results, this will keep people (including me) honest. There're legally abiding terms published on this Forum, and any change on some of them would be considered fraud - having more people with the same data is a safety measure for everybody involved, as we can support each other in the case of a dispute. I completely agree that plain mirroring the site is a bad idea. In order to have a chance to survive, the Archives must try their best to be plausibly considered fair use on a Court, not to mention gathering people to support on our case, prompting TTI (or anyone that ends up buying the lemon IP) to consider any earnings on taking the thing down versus the drawbacks on P/R and deciding it's their best interest not to intervene in a destructive way. However, we need to help them to help us (willingly or not). So we need to address some elephants in the room (and, yeah, you are really right on the money): Impersonation Dude, this is absolutely a no-go. Under no circumstances one can republish Forum's data in a way that may lead people to believe that you are them. So you just can't publish a mirror of the thing ipisi literis using a different URL. Plagiarism Ditto! If you change the content in an attempt to prevent the Impersonation, you are... well... changing the IP and publishing a derivative!!! This is piracy, simple like that. Copyright Our only hope of success is to rely on the Copyright loopholes that may allow us to legally do this stunt. Given the above considerations, I concluded that going Internet Archive is the most viable solution. The Look and Feel makes absolutely sure you are not impersonating Forum or TTI, the content is preserved preventing plagiarism and since the Internet Archive managed to legally publish their archives, this is a precedent that we may use to do the same. TTI will always have the right to file a DMCA on anyone publishing such Archive, however. To tell you the true, they can do it even on our personal sites about the franchise (see Nintendo). So let's discourse about what would prompt them to do that: Risk of losing control of the IP Devaluation of the IP Lost of revenue (direct or indirect) Someone on TTI waking up in a bad mood in the morning Going Internet Archive style mitigates the Risk 1 and Risk 2 - as a matter of fact, having this content preserved in case of the worst may even salvage some of the IP's value, as invaluable content to reboot the Community will be still available to anyone owning the Franchise in the future - it's notorious that even Nintendo had to rely on "backup sites" to be able to publish themselves some of the ROMs they sold in cartridges in the past! The Risk 3 is something we don't have to worry about, as Forum doesn't generate direct revenue - and the indirect ones we had covered by mitigating Risks 1 e 2. About the Risk 4, the only defense we have is P/R. They had a huge backslash on the KSP2 drama, and that hurts - right now I'm pretty sure there's someone there overviewing everything to prevent another one. Bad P/R costs them money, huge amounts of money. And they are on the game (pun not intended) for the money. So, as long we manage to help them to help us (willingly or not), we have a reasonable chance to score this stunt. (Ab)using a bit the Game Theory, these are the possible outcomes (as long we stick to the rules I'm trying to delineate): We do the Archiving, the Forum survives: Content preserved. We do the Archiving, the Forum dies: Content preserved. We do not do the Archiving, the Forum survives: Content preserved. We do not do the Archiving, the Forum dies: Content is lost. Since our main (and only) goal is the survival of the Content (as nobody here is going to make any money, directly or indirectly, with it), where are the better chances of saving the Content? Well, doing the Archive ourselves. So the logical decision is doing the Archive. But, by then, we risk being taken down on a DMCA, right? What are the possible outcomes? Forum survives, TTI issues take down on the Archives: Content preserved. Forum survives, TTI ignore the Archives: Content preserved. Forum dies, TTI ignore the Archives: Content preserved. Forum dies, TTI issues take down on the Archives: Content is lost. Again, since our goal is the preservation of the Content, it's our best move to do the Archiving the same. What matters if TTI takes them down in the future, as long Forum is alive? And if Forum ever dies, it's still their best interest to preserve the content as any future reboot of the franchise would benefit from it. Heck, I would not be surprised if someone on TTI ends up making a copy of our Archives for them. === == = POST EDIT = == === Spoiler ==== DISCLAIMER ==== We are not cheap work force. We are not doing it for them. We are doing it for ourselves in a way that they would also be benefited, we are working to achieve a win-win situation. Yes, they royally screwed the pooch and there's a chance we would be helping to save their sorry arses. But we are doing it to save our own - saving theirs is a compromise to enhance our chances. Stone Soup. Edited July 10 by Lisias Entertaining grammars made slightely less entertaining... Quote Link to comment Share on other sites More sharing options...
Fizzlebop Smith Posted July 9 Share Posted July 9 I agree that the spirit of what has been propised thus far, bolsters the community. Having a back up of all the specific content related to modding is paramount. Those games that have a 10-20 year following seem (in my meager experience) to be centered around vast libraries of custom content. It seems the loss of this information in a centralized location, would certainly devalue the IP. I am very grateful to those with the technical expertise to see this through. Quote Link to comment Share on other sites More sharing options...
RKunze Posted July 9 Share Posted July 9 11 hours ago, Lisias said: <n> (where n > 1) federated hosts registered on a central http server, that does a temporary redirect (http 307) in a round robin to the registered mirrors. A daemon pings the known mirrors and remove them from the pool when offline, and puts them back when back [ ... snip ... ] We have still a point of failure, the central http server, but this one is incredibly easy to replace. Sort the federated hosts into "stable" (is up with the same IP longer than a reasonable DNS TTL for the domain name) and "unstable" (everything else), have the stable ones play redirectors for the unstable ones (as well as serving their own share of contents) and put A/AAAA records for the m (1<=m<=n, choose to fit your redundancy needs) most stable ones under the main domain name into DNS (should give you also a bit of load balancing on the DNS level, since multiple A/AAAA records for the same name are usually used in round-robin fashion by clients). Can even be automated if you have a DNS provider that allows zone transfers (or zone updates over some other API), and you don't even need to worry about changing the DNS glue (just put the DNS providers' servers as glue into the parent zone and update those from a subset of the federated servers acting as "hidden primaries"). Been there, done that, works pretty good. Quote Link to comment Share on other sites More sharing options...
HB Stratos Posted July 10 Share Posted July 10 (edited) Hi everyone, I just found this thread. I wasn't aware it existed until now. Here's what I've been working on since monday: I parsed the KSP Forums sitemap.php with a python script and compiled a txt file of ~78.000 links existent on the forum. (I'd happily share the python code or list if needed) I then fed that long list of links into winHTTrack and it has been working away ever since, as far as possible. I had to take some breaks to fix configuration issues like a too low max link count or a too high external depth, but I'm making progress. At the moment I have ~89,000 pages under the forum.kerbalspaceprogam.com domain saved along with a good bit of external stuff like imgur image links, though httrack is configured to grab html files first. I could really use some help in the cleanup of the data after it is all done and downloaded. Download will take a hot minute though, I can't pull more than ~10-30KB/s from the forums without effectively DOSing them. Also let me know if there's anything else I can do. Edited July 10 by HB Stratos Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 10 Share Posted July 10 (edited) 7 hours ago, HB Stratos said: Hi everyone, I just found this thread. I wasn't aware it existed until now. Please give a peek on this post and this other one. 7 hours ago, HB Stratos said: I could really use some help in the cleanup of the data after it is all done and downloaded. Download will take a hot minute though, I can't pull more than ~10-30KB/s from the forums without effectively DOSing them. Also let me know if there's anything else I can do. Oh, and on this one too! There's a WARC "dump" up to 2023-05 already on WebArchive. It's about 8G of packed data, but it's already fetched from Forum and, so, there's no need to fetch them again. Currently, I'm working on downloading that thing and then I will create a complementary WARC over the 2023-05 one, and then I will see how to feed these data files on the wild (probably by a torrent). With all these data files on hand, we will be able to do some interesting things - but please read my post above where I discuss about legalities. === BRUTE FORCE POST MERGE === To anyone willing to download the Internet Archive data, this dataset doesn't have a torrent, unfortunately. So I made this little script to download that huge basket of bytes using wget with the option to recover the downloads if things goes south in the process. Worst case scenario, you run the script again, no data loss. #!/usr/bin/env bash for f in forum.kerbalspaceprogram.com-00000.warc.gz forum.kerbalspaceprogram.com-00000.warc.os.cdx.gz forum.kerbalspaceprogram.com-00001.warc.gz forum.kerbalspaceprogram.com-00001.warc.os.cdx.gz forum.kerbalspaceprogram.com-meta.warc.gz forum.kerbalspaceprogram.com-meta.warc.os.cdx.gz forum.kerbalspaceprogram.com_202305.cdx.gz forum.kerbalspaceprogram.com_202305.cdx.idx forum.kerbalspaceprogram.com_202305_files.xml forum.kerbalspaceprogram.com_202305_meta.sqlite forum.kerbalspaceprogram.com_202305_meta.xml ; do wget --continue https://archive.org/download/forum.kerbalspaceprogram.com_202305/$f done Edited July 10 by Lisias brute force post merge Quote Link to comment Share on other sites More sharing options...
xD-FireStriker Posted July 10 Share Posted July 10 On 7/8/2024 at 11:35 PM, Lisias said: I will check this by night (GMT-3). Keep in mind that that is a dump from the Browser point of view, we just don't have access to the Forum inner guts, so no private messages or anything that demands being logged on to see. I expected as much, no dms will be saved but i have seen the internet archive correctly archive 1 page but fail spectacularly at another. On 7/9/2024 at 8:19 AM, Lisias said: The hard part will be to expect the worst without causing it - like a self-fulfilling prophecy. yeah totaly agree, I dont wanna move platforms if we dont have too Quote Link to comment Share on other sites More sharing options...
Lisias Posted July 10 Share Posted July 10 3 hours ago, xD-FireStriker said: I expected as much, no dms will be saved but i have seen the internet archive correctly archive 1 page but fail spectacularly at another. Finally answering your original question, yes. That material is hot. However... Only the application/http documents are saved on the WARC files. I would like to have the images hosted on forum archived too, but whatever. This is easily fixable with the tool I choose, pywb. But there's a catch - the pywb tool apparently doesn't agree with Internet Archive about how to calculate the digests, and so I ended up wasting some time redownloading the damned thing thinking that the download was corrupted somehow (the dumb-ass typing this post only thought on using gzip --test after redownloading the freaking gzipballs). I'm currently reindexing the archive to see if we it will ignore the digest, or if I will need to fix about ~65 GB of http dumps to fix them myself - Kraken save the BTRFS with compression activated - it's saving a lot of I/O here. I will come back to you as soon as I manage to import this data on my current pywb collection. Quote Link to comment Share on other sites More sharing options...
HB Stratos Posted July 10 Share Posted July 10 5 hours ago, Lisias said: To anyone willing to download the Internet Archive data, this dataset doesn't have a torrent, unfortunately. So I made this little script to download that huge basket of bytes using wget with the option to recover the downloads if things goes south in the process. Worst case scenario, you run the script again, no data loss. #!/usr/bin/env bash for f in forum.kerbalspaceprogram.com-00000.warc.gz forum.kerbalspaceprogram.com-00000.warc.os.cdx.gz forum.kerbalspaceprogram.com-00001.warc.gz forum.kerbalspaceprogram.com-00001.warc.os.cdx.gz forum.kerbalspaceprogram.com-meta.warc.gz forum.kerbalspaceprogram.com-meta.warc.os.cdx.gz forum.kerbalspaceprogram.com_202305.cdx.gz forum.kerbalspaceprogram.com_202305.cdx.idx forum.kerbalspaceprogram.com_202305_files.xml forum.kerbalspaceprogram.com_202305_meta.sqlite forum.kerbalspaceprogram.com_202305_meta.xml ; do wget --continue https://archive.org/download/forum.kerbalspaceprogram.com_202305/$f I've used https://github.com/jsvine/waybackpack with success before to download stuff from archive.org. I forgot if I wrote a script to feed it hundreds of links or if it already has that capability, but it has done it's job well where I needed it. 5 hours ago, Lisias said: Please give a peek on this post and this other one. already saw both of them. I'm aware they may file a takedown, which sucks, but there's nothing illegal about downloading a website for private storage. So I'll continue to do this either way so I have the peace of mind of having the data no matter what might happen. 5 hours ago, Lisias said: Oh, and on this one too! There's a WARC "dump" up to 2023-05 already on WebArchive. It's about 8G of packed data, but it's already fetched from Forum and, so, there's no need to fetch them again. Currently, I'm working on downloading that thing and then I will create a complementary WARC over the 2023-05 one, and then I will see how to feed these data files on the wild (probably by a torrent). With all these data files on hand, we will be able to do some interesting things - but please read my post above where I discuss about legalities. Good to know. Unfortunately I can't think of an easy way to merge this with the winhttrack output, so I think I'll unfortunately be stuck with just downloading everything. Quote Link to comment Share on other sites More sharing options...
m4ti140 Posted July 10 Share Posted July 10 (edited) On 7/8/2024 at 11:41 AM, Lisias said: There's the tool the Internet Archive uses: https://archive.org/developers/internetarchive/index.html Somewhat more complicated than httrack (what I had used for a decade, by the way), but it's how these guys do things. It also worths to mention that I found this too: https://archive.org/details/forum.kerbalspaceprogram.com_202305 Updated at May 2023 with this tool. I'm reading that documentation to see if we could use the 2023.5 dump as a baseline to save Forum's bandwidth. Additionally, I think it's a very good idea to have this dump shared between many people, again to save Forum's bandwidth - there's no need to every single user here to do the job by themself, we can collaborate. Internet Archive isn't fully safe either, because they are under assault by corporate lawyers recently. They had to take down 500k books that were all lent one copy at a time, rather than distributed for free download. It's not out of the question that they could go down completely one day. Someone needs to store the dump locally in case it ever needs to be hosted elsewhere. EDIT: nvm, I see that this was the intention all along, disregard. Edited July 10 by m4ti140 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.