Jump to content

KSP Forums Archival Options.


Recommended Posts

13 hours ago, HB Stratos said:

Good to know. Unfortunately I can't think of an easy way to merge this with the winhttrack output, so I think I'll unfortunately be stuck with just downloading everything. 

This tool is not simple thing, neither. I spent a lot of time just trying to setup the damned thing - but once you do the dirty work, it just works.

There're tools to build WARC files from dumped files, but you lose what's most important - the metadata that guarantee the data wasn't tampered.

In a way or another, once I manage to have this grimacing on the works, I will share everything (somehow), so you can have them if you want.

 

9 hours ago, m4ti140 said:

Internet Archive isn't fully safe either, because they are under assault by corporate lawyers recently.

Yes. Being the reason I decided to "go rogue" and do things the hard way. :)

 

14 hours ago, HB Stratos said:

I've used https://github.com/jsvine/waybackpack with success before to download stuff from archive.org. I forgot if I wrote a script to feed it hundreds of links or if it already has that capability, but it has done it's job well where I needed it.

Good idea. The problem I have with archive.org is that I already detected that some pages are missing from the historic, and every time I tried to add this pages to the crawler, I was greeted with a error message to the point I started to think that TTI issued a take down on them. Good to know I was wrong, but I still have the missing pages problem to cope with.

But, still, it's a good idea - they are not mutually exclusive solutions. As a matter of fact, in theory I can add the waybackpack WARCs to be indexed and served by pywb the same. Once I finish to install this Kraken damned tool (see below), I will pursue this venue too.

 

=== == = NEWS FROM THE FRONT = == ===

The tool is working (finally), except by crawling. There was no instructions about how to deploy some browser side dependencies, not to mention that I'm using Firefox that have some javascript shenaningans that demanded some changes while deploying - so, yeah, once this thing is working, some pull requests will be made. :D

Right now, I'm cursing the clouds because a browser side library was migrated to TypeScript, and I'm installing a node.js (blergh) environment to compile the damned thing into javascript and then deploy it.

All this work will be available to whoever wants it, I will publish a package with batteries included to make the user's life easier - or less harsh, this tool is a "professional" archiver, it's way less friendly than httrack, for example.

 

Link to comment
Share on other sites

Awesome work Lisias, great to hear!

On my front things have been going somewhat okay. I messed up my first download to the point where it was too hard to recover the temporary download, but on the bright side I made all my mistakes on that one, and was able to start a new download attempt with fixed settings. It's now downloading an order of magnitude faster. I'm managing to average 150 kb/s and have already recovered in 6 hours what took two days last time. It will be a bit more of a pain to parse in the end as I'm writing DOS compatible filenames instead of the full web path, but I'm hoping this will fix some cross linking issues I was seeing. For now I just want all of the data to exist, formatting can be fixed even with the forums lost.

Link to comment
Share on other sites

On 7/8/2024 at 5:18 AM, Superfluous J said:

I can't speak for the community, but as one of its members I'll speak for me.

Without this Forum I'll stop interacting with you all. I won't frequent Discord, Reddit, or any other site. I don't play KSP or KSP2 anymore and I don't generally visit sites or channels for games I don't play.

KSP is actually a huge exception from that rule because I've barely played for years yet am still here.

I agree that a traditional web forum is best for KSP and its many discussions, mods, and challenges.

There can be a lot of arguments made one way or another about why TTI will keep this forum or why TTI will shut this forum down.  I note that is very similar to discussion that happened earlier about KSP2--and we all know what happened there.

We can't guarantee this forum will continue.  But if this forum still exists, it complicates things for any new KSP Community forum.  (Although if it is started, I'll make an account there.)  But if this forum goes away without warning, we will want a KSP Community forum to replace it.

That means there's a bit of a dilemma.  I suggest the following as the solution.

If these forums are shut down without warning, I would suggest checking KSP oriented Discords and the KSP subreddit for information on a KSP Community driven replacement forum.  There should also be an informative banner set up at Space  There are people in the KSP community who will do that.  And we'll all move there.

Link to comment
Share on other sites

Suggestion:

I know the various groups are waiting to see if the forums actually go down before making their version of the site go live.   Why spend the money if it’s not needed right?

I’m working under the assumption that the backups created do not have access to any critical user data that could be used for identity verification; and  These backups would be dead archived threads that would be referenced, but not interacted with.  

If this is the case, if any of the teams undertaking the task have the infrastructure available, I would suggest going live in early September or so.  Not for moving traffic over there now, but to allow users to create accounts and verify they are who they claim to be by posting over here.    Each new site would start their own thread here, and each user here would post a link to their new account on the new site.     
 

This would prevent name grabbing on the new sites, and allow for a smoother transition, if needed.   
 

A couple major downsides to this:
 

Launching a new site costs money out of pocket, and if this site doesn’t go down, that won’t be recovered.     It’s not allowed right now for gofundme type posts here on the forums, but we can discuss it internally with the moderator team if some allowances could be made.   But until stated explicitly, assume not.  
 

The other is the fracturing of the community with competing sites.   This is bad, very bad.   Perhaps only going partially live on each site to only allow “Hey I’m here type posts”  until it becomes an obvious necessity.    We stick with this one until we have to give the abandon ship order, which may never come.   After that, well, it’s Darwinism then.  
 

— This post is not made as moderator, just a concerned community member.    I have not discussed any of this with the team, and nothing I’ve said should imply the rest of the team condones any of it.    —

Link to comment
Share on other sites

FYI, i am roughly around 90% of the way through backing up all the entire forum data.

not as basic HTML, but i have written a scraper that grabs all required metadata, and stores it as json files (1 json file per forum page, with separate meta data files).

i will be starting to upload the data within the next few days, and also keeping the data up to date at the end of each day with the most recent forum posts.

I will more than likely be pushing it to somewhere like github, as well as the code that goes with it, unless anybody has a better idea?

Link to comment
Share on other sites

FWIW I already have at least one complete backup of the forum in static HTML form.

Working on a second now, I run them at very low bandwidth continuously.  I would suggest scrapers bandwidth limit their efforts like me for 2 reasons:

1.) We don't want to break the forums.

2.) The forum can and will tempban your scraper as a form of rate-limiting.  This is not immediately apparent, and will result in you scraping a bunch of HTML "Forbidden" pages.

The amount of simultaneous connections seems to be able to go no more than 1, for one example.

Frankly I'd discourage further scraping as between current efforts, I'm sure it's covered and we don't want to kill the forums.

That said, if you absolutely are going to do this anyways, these are HTTrack settings I have found that will work:

Near=0
Test=0
ParseAll=1
HTMLFirst=0
Cache=1
NoRecatch=0
Dos=0
Index=1
WordIndex=0
MailIndex=0
Log=1
RemoveTimeout=0
RemoveRateout=0
KeepAlive=1
FollowRobotsTxt=2
NoErrorPages=0
NoExternalPages=0
NoPwdInPages=0
NoQueryStrings=0
NoPurgeOldFiles=0
Cookies=1
CheckType=1
ParseJava=1
HTTP10=0
TolerantRequests=0
UpdateHack=1
URLHack=1
StoreAllInCache=0
LogType=0
UseHTTPProxyForFTP=1
Build=0
PrimaryScan=3
Travel=1
GlobalTravel=0
RewriteLinks=0
BuildString=%%h%%p/%%n%%q.%%t
Category=
MaxHtml=
MaxOther=
MaxAll=
MaxWait=
Sockets=1
Retry=3
MaxTime=
TimeOut=45
RateOut=
UserID=Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)
Footer=<!-- Mirrored from %%s%%s by HTTrack Website Copier/3.x [XR&CO'2014], %%s -->
AcceptLanguage=en, *
OtherHeaders=
DefaultReferer=
MaxRate=5242880
WildCardFilters=+*.css +*.js -ad.doubleclick.net/* -mime:application/foobar%0d%0a+kerbal-forum-uploads.s3.us-west-2.amazonaws.com/*%0d%0a-forum.kerbalspaceprogram.com/tags/*%0d%0a-forum.kerbalspaceprogram.com/forum/38-forum-games/*%0d%0a-forum.kerbalspaceprogram.com/discover/*
Proxy=
Port=
Depth=4
ExtDepth=
MaxConn=
MaxLinks=10000000000
MIMEDefsExt1=
MIMEDefsExt2=
MIMEDefsExt3=
MIMEDefsExt4=
MIMEDefsExt5=
MIMEDefsExt6=
MIMEDefsExt7=
MIMEDefsExt8=
MIMEDefsMime1=
MIMEDefsMime2=
MIMEDefsMime3=
MIMEDefsMime4=
MIMEDefsMime5=
MIMEDefsMime6=
MIMEDefsMime7=
MIMEDefsMime8=
CurrentUrl=
CurrentAction=0
CurrentURLList=

 

Edited by R-T-B
Link to comment
Share on other sites

15 minutes ago, HebaruSan said:

This looks fairly useless. I checked a familiar thread, and it seems to have confused the title with one of the tags, and the actual title and text are nowhere to be found in the repo.

could you at least tell me which one so i can go in and fix the issue that is causing it? just saying "its bad" doesnt help anybody

Link to comment
Share on other sites

1 hour ago, bizzehdee said:

Fairly interesting approach. I will give this a peek during the night - you are essentially "competing" with WARC. :)

From my side, I will bite the bullet and insist with pywb, besides not exactly happy with the multidaemon solution they choose (external CDX indexer). The direct alternative, OpenWayback, is deprecated and the Internet Archive tools are yet less user friendly.

I found a external crawler, by the way, that I can rely to doing the crawling instead of injecting wombat into the javascript land no the browser. The single binary solution was already gone trough the window, anyway...

Link to comment
Share on other sites

3 hours ago, bizzehdee said:

could you at least tell me which one so i can go in and fix the issue that is causing it? just saying "its bad" doesnt help anybody

On the contrary, it helps inform other users who might otherwise stop their own archives on the assumption that there's already a good one out there.

Here are the files that say ""TopicTitle":"transfer calculator"", which, no:

And the actual thread:

Link to comment
Share on other sites

12 hours ago, HebaruSan said:

On the contrary, it helps inform other users who might otherwise stop their own archives on the assumption that there's already a good one out there.

Here are the files that say ""TopicTitle":"transfer calculator"", which, no:

Your report is incomplete and alarmist.

Yes, you found a problem. But failed to report that other threads were fetched Allright.

I agree that this is still work in progress. I disagree that it's useless. It's just not ready yet.

My guess is that @bizzehdee's crawler is failing to detect when the reponse returns an empty page under an http 200. I suggest to check if the response is valid and, if not, to sleep a few seconded and try it again.

It's what I was doing, by the way, when I accidentally fired the crawler without auto throttle and got a 1015 rate limited from cloudflare... :rolleyes: oh, well... I will do some KSP coding in the mean time. :D

What reminds me:

NEWS FROM THE FRONT

I gave the 1 finger salute to the pywb's wombat. I'm doing the crawling using Scrapy and a customized script to detect the idiosyncrasies, and pywb is now setup as a recording proxy - something it really excels at. The only drawback is the need to setup a redis server for deduplication.

On the bright side, the setup ended up being not too much memory hungry, I'm absolutely sure I will be able to setup a Raspberry PI 4 (or even a 3) to do this job! :cool:

Setting up a public mirror, however, may need something more powerful (but I will try the RaspPi the same). For replaying, you need a dedicated CDX server in Java to be minimally responsive.

And, yes, the thing is creating WARC files as a champ. This solution is 100% interoperable with Internet Archive and almost every other similar service I found. If I understood some of the most cryptical parts of the documentation, we can confederate each other mirror's on the pywb itself, saving us from NGINX and DNS black magic. ;)

Note to my future self: don't fight the documentation, go for the Source! :sticktongue:

=== == = POST EDIT = == ===

Oukey, the 1015 ban was lifted while I typed this post from my mobile. :)

Back to @bizzehdee, follows a thread that were fetched correctly:

https://github.com/bizzehdee/kspforumdata/blob/main/topics/1000-meta.json

https://github.com/bizzehdee/kspforumdata/blob/main/topics/1000-1-articles.json

Again, the crawler needs some work to workaround Cloudflare's idiosyncrasies (being the http 200 with an empty page the most annoying), but the tool is working almost fine.

And this parsed data will be very, very nice to feed a custom search engine!

=== == = POST EDIT² = == ===

I found an unexpected beneficial side effect of using a local, python based, crawler - now it's feasible to distribute tasks!

Once we establish a circle of trusted collaborators, we can divide the task in chunks and distribute them between the participants. This will lower the load on Forum, save bandwidth for each participant and accelerate the results.

As soon as I consolidate the changes and fixes I did during the week on this repo, I will pursue this idea.

Edited by Lisias
POST EDIT²
Link to comment
Share on other sites

The issue affected a very very small number of posts that had a tag prefix.

Rerunning the update script that will not only top up the latest posts from the last 24 hours, but will also fix the titles for this small number of posts.

But if i am stepping on peoples toes, ill gladly just throw all of this structured data into the bin, and let everyone struggle along with just having html dumps that would be effectively useless when trying to recreate a forum.

Link to comment
Share on other sites

5 hours ago, bizzehdee said:

But if i am stepping on peoples toes, ill gladly just throw all of this structured data into the bin, and let everyone struggle along with just having html dumps that would be effectively useless when trying to recreate a forum.

Please don't! I'm planning to use your metadata do double check what I'm doing - I don't wanna loose content due some unexpected condition not handled by the stack!

Cross checking is the key to guarantee that.

 

1 hour ago, bizzehdee said:

Thank you!

Link to comment
Share on other sites

directory structure is based on the leaf forum id that it is in. So for 155998, its in 34, as it is "KSP1 Mod Releases", which is forum id 34.

So the relationship is quite basic, the directory name is the ID of where the post came from...

Edited by bizzehdee
Link to comment
Share on other sites

10 minutes ago, bizzehdee said:

directory structure is based on the leaf forum id that it is in. So for 155998, its in 34, as it is "KSP1 Mod Releases", which is forum id 34.

So the relationship is quite basic, the directory name is the ID of where the post came from...

Oh. In that case, a new problem report: 155998 is missing completely now. The 34 directory only goes up to 105354 or so.

On second glance, that's GitHub limiting the listing. Never mind, I'll edit the URLs...

Edited by HebaruSan
Link to comment
Share on other sites

2 minutes ago, HebaruSan said:

Oh. In that case, a new problem report: 155998 is missing completely now. The 34 directory only goes up to 105354 or so.

On second glance, that's GitHub limiting the listing. Never mind, I'll edit the URLs...

https://github.com/bizzehdee/kspforumdata/blob/main/topics/34/155998-meta.json

https://github.com/bizzehdee/kspforumdata/blob/main/topics/34/155998-1-articles.json (replace the -1- with whatever page number you need)

 

Any issues, let me know, or submit a ticket on github. im trying to make it all as complete as possible

Link to comment
Share on other sites

On 7/13/2024 at 8:22 AM, Lisias said:

I found an unexpected beneficial side effect of using a local, python based, crawler - now it's feasible to distribute tasks!

Once we establish a circle of trusted collaborators, we can divide the task in chunks and distribute them between the participants. This will lower the load on Forum, save bandwidth for each participant and accelerate the results.

As soon as I consolidate the changes and fixes I did during the week on this repo, I will pursue this idea.

Sounds great! I'd definitely be willing to help out. my own efforts are running into hardware and software issues, so I feel consolidating our efforts makes a whole lot of sense. 

On 7/12/2024 at 7:55 AM, R-T-B said:

FWIW I already have at least one complete backup of the forum in static HTML form.

Working on a second now, I run them at very low bandwidth continuously.  I would suggest scrapers bandwidth limit their efforts like me for 2 reasons:

1.) We don't want to break the forums.

2.) The forum can and will tempban your scraper as a form of rate-limiting.  This is not immediately apparent, and will result in you scraping a bunch of HTML "Forbidden" pages.

The amount of simultaneous connections seems to be able to go no more than 1, for one example.

Frankly I'd discourage further scraping as between current efforts, I'm sure it's covered and we don't want to kill the forums.

That said, if you absolutely are going to do this anyways, these are HTTrack settings I have found that will work:

could you perhaps make your current download available somewhere? I'd like to use it so I don't have to re-download a million pages that you already have and expand the DL to include external images, etc. 

Link to comment
Share on other sites

1 hour ago, HB Stratos said:

could you perhaps make your current download available somewhere? I'd like to use it so I don't have to re-download a million pages that you already have and expand the DL to include external images, etc. 

If you have a file host that can host ~10GBs we can talk.  I have a personal server I can initiate an FTP transfer from too.

But there are some issues with the current site image image I am working through (mostly the fact that some pages get blanked when the server goes down as it often does, I am doing a second pass to re-retrieve those).  If you check in with me say a week from now I should have something flawless, and would be happy to share if you can provide the means to receive it.

Edited by R-T-B
Link to comment
Share on other sites

I have a home server that I can slap a big storage drive on, and a proxy server with a public IP that I can forward the connection to. So let me know whenever you have a solid DB (and please also share your final htttrack settings) and I'll get a fileshare set up. 

Link to comment
Share on other sites

3 minutes ago, HB Stratos said:

I have a home server that I can slap a big storage drive on, and a proxy server with a public IP that I can forward the connection to. So let me know whenever you have a solid DB (and please also share your final htttrack settings) and I'll get a fileshare set up. 

doesn't need to be too big, its just 10GBs, but way too big for any sensible public/free file host lol.

I'll ping you all here as soon as I have something.

Edited by R-T-B
Link to comment
Share on other sites

6 minutes ago, HB Stratos said:

Mega nz provides like 50gb for free last time I checked, might also work.

I'll check in with that.  Raring or 7zipping it may also bring the size down a fair bit, given its mostly html hypertext.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...