KSP Forums Archival Options.

Lisias · July 30, 2024

On 7/30/2024 at 12:50 AM, Ultimate Steve said:

Firstly, thank you for all of your preservation efforts!

Thx!

On 7/30/2024 at 12:50 AM, Ultimate Steve said:

This isn't strictly preservation related, but you're probably a lot more well versed in the legalese than I am, but when you say "images and styles", does that refer to like the background and banners of the forum, or pictures in posts?

Yes!

On 7/30/2024 at 12:50 AM, Ultimate Steve said:

And you're saying, for sure, anything posted here belongs to the poster

Nope. The Legislation is saying, I'm just the messenger!

First, we need to understand the difference between "authorship" and "ownership". They are two different things.

Ownership is plain straightforward, it's the exact same concept of owning a videogame console or car. You have a set of rights about a property by owning it.

Authorship is a bit less straightforward, but in essence is "who wrote the damned thing at first place". You also have a (different) set of rights about a intellectual property by authoring it.

In USA, these two concepts are somewhat blurred because in USA's legislation it's possible to "transfer" the authorship of a IP to someone else - for example, ghost writers. In some other countries (like mine), authorship is nontransferable, point. So a ghost writer will be forever the author of the book and nobody can, legally, get it from him. But the ghost writer can, still, sell all the ownership rights to who is paying him - so for all what matters from the commercial point of view, it works the same.

About ownership, you automatically owns anything you author by default, additional terms of service or contracts are needed to transfer such rights. It's the reason one needs to agree with Forum's ToS to post here, otherwise Forum would be subjected to the whims of copyright trolls around the World.

And what the ToS says?

Quote

5.2 Rights to UGC. You retain whatever rights, if any, you may have under applicable law in Your UGC. If you do hold any such rights to Your UGC, including any copyright or other intellectual property interest, then, in exchange for the rights licensed to you in this Agreement, you hereby grant us an irrevocable, worldwide, royalty-free, non-exclusive and sublicensable right to use, reproduce, edit, modify, adapt, create derivative works based on, publish, distribute, transmit, publicly display, communicate to the public, publicly perform, and otherwise exploit Your UGC within or via the Services or for any other commercial and non-commercial purpose related to the Services, including but not limited to the improvement of the Services, without compensation or notice, for the full duration of the intellectual property rights pertaining to Your UGC (including all revivals, reversions, and extensions of those rights). Without limiting the foregoing, the rights licensed to Take-Two herein explicitly include the right for Take-Two to allow other users to use Your UGC as part of our operation of the Services. By creating, uploading, or distributing Your UGC to or via the Services, you represent to us that you own any rights in and to Your UGC on a sole and unencumbered basis, and that any such rights you grant to us in this Section, and our exploitation of those rights, will not violate or infringe the rights of any third parties.

So:

UCG is "user generated content", or your posts.
"You retain whatever rights, if any, you may have under applicable law in Your UGC. "
- Forum is not claiming ownership of your UCG. POINT. Forum never did since I'm here, and I'm here since 2018
"you hereby grant us an irrevocable, worldwide, royalty-free, non-exclusive and sublicensable right to" (yada yada yada)
- So, oukey, they will use your UCG the way they want, and they may make some money from it and they don't own you a penny.
- But that's all, you are still the author and the owner of your content!

And that's it. All your posts here can be used by TTWO the way they want as a licensee, because you granted such license to them by posting things here. But this license, besides irrevocable, is also non-exclusive, i.e., they are not claiming that only them have the right to do such things with your posts, you are entitled to give such rights to anyone else if you want - but you can't revoke the rights you already granted to Forum.

And it's simple like that, once you decode all that legalese into plain English.

On 7/30/2024 at 12:50 AM, Ultimate Steve said:

I was under the impression that T2 legally owned any content posted here for some reason (probably the panic back when the old EULA change took effect), if this is not in fact the case, and I actually own all the stories and images that were used in the stories that I've made, this is a really big weight off my shoulders.

All that drama was FUD. There were reasons to criticize TTWO at that time (and now we know it pretty well), but this one - definitively - wasn't one of them, and I dare to say that this helped to sweep under the carpet the real troubles.

Edited July 31, 2024 by Lisias
Entertaining grammars made slightely less entertaining...

HB Stratos · July 31, 2024

On 7/29/2024 at 9:01 PM, Lisias said:

But this link also returns a "This content is no longer available" error message the same, because IA had revisited this link after the link had expired. There's no previous visit for this link, so that specific content is lost for good now.

That is why my script uses /0if_/ , where the 0 stands in place of the usual timestamp of the archive you are requesting from archive.org. It always jumps to the nearest available one, so putting a 0 there makes it always select the first existing archive of that link, which is usually the working one. But yes, if the @match in the script is extended to cover whatever the new website domain would become, it would continue to work there, as it also would anywhere else on the internet it is enabled.

EDIT: nevermind I somehow missed the part where you wrote there was no prior archive. Yep, that is completely gone then. My script won't break with that, it'll just have yet another 404 in the console.

Edited July 31, 2024 by HB Stratos

Lisias · August 1, 2024

Fun Facts:

The very first publicly available post is dated 2011-0618:2349
- @HarvesteR sleepless at that day, uh?
Forum reached 1.000 posts at 2011-0724:2203
Forum reached 10.000 posts at 2012-0428:1833
Forum reached 100.000 posts at 2015-0218:0917
Forum reached 200.000 posts at 2021-0201:1321

Forum took about a month to reach 1.000 posts, 10 months to reach 10k posts and about 4 years (a bit less) to reach 100K posts.

And then it took another 6 years to reach 200K posts - slightly slower, but not that much...

This kinda matches the Steam Charts:

This Forum was pretty crowded up to 2015 more or less, more or less matching the concurrent players on Steam Charts.

You see, we will be able to do some interesting stats using data from the Forum - having all that data offline in your hardisk allows us to munch the data way faster, and without hitting Forum all the time.

NEWS FROM THE FRONT

I came to terms that a collaborative and distributed effort for scraping the Forum in a decentralized way would degrade relatively fast to everybody scraping everything, and then someone deduplicating the WARC files offline.

The reason is that Forum doesn't exactly favour our efforts due the way it handles timestamps on the posts! Instead of a full timestamp, Forum uses relative time deltas (seconds ago, minutes ago, days ago, etc) and this changes every second, changing the page contents, changing the digest, making it pretty hard to deduplicate them. At least until the page/comment reaches an age in which forum decides to use full timestamps.

For people that love gory details, this is this html tag from the previous post above:

<time datetime="2024-07-31T21:59:13Z" title="07/31/2024 06:59  PM" data-short="16 hr">16 hours ago</time>

This means that this very page will render differently in an hour, even if I would not had added this post now.

So the digest (on redis) deduplication ends up being pretty useless at this time for anything newer than thread 222455, as anything newer will start to get that that pesky time deltas - exactly where the work will focus on - defeating the purpose of the idea at first place.

So the idea is degenerating from a Federated approach to a classic Master/Slaves one, what means that someone will need to host that Master, and that some kind of authentication method will be needed to defend from adversarial slaves pretending to help...

Trying to find my way on it.

Edited January 4 by Lisias
(scrap != scrapp!!!)

RKunze · August 1, 2024

4 hours ago, Lisias said:
For people that love gory details, this is this html tag from the previous post above:
<time datetime="2024-07-31T21:59:13Z" title="07/31/2024 06:59  PM" data-short="16 hr">16 hours ago</time>
This means that this very page will render differently in an hour, even if I would not had added this post now.

If you don’t mind that you lose the relative timestamps, you could pre-parse the page on scraping, strip out "data-short", replace the "16 hours ago" text with a formatted version of "datetime" (eg. "July 17 2024 21:59 UTC") and feed that edited version to the deduplication code. That should give you identical hashes even for new pages, and won't lose any relevant info for archiving.

Lisias · August 1, 2024

1 hour ago, RKunze said:

If you don’t mind that you lose the relative timestamps, you could pre-parse the page on scraping, strip out "data-short", replace the "16 hours ago" text with a formatted version of "datetime" (eg. "July 17 2024 21:59 UTC") and feed that edited version to the deduplication code. That should give you identical hashes even for new pages, and won't lose any relevant info for archiving.

But would mangle the bitstream, breaking one of the premises of the project: absolutely no derivatives: whatever I will write on the WARC file must be exactly what Forum sent me.

I'm currently working on some pywb extensions that will help me on this task. Your idea is very good, to tell the truth - I wondering if I can use it somehow on that extensions I'm writing now.

At very worst, I think your idea would do beautifully on an output filter on the pywb replay!

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/issues/8

Lisias · August 4, 2024

NEWS FROM THE FRONT

It's house keeping time.

I found a way to minimize unnecessary accesses to Forum, as well to better containerize the assets. I also created some add'ons to pywb. Check the project's repository for the gory details, in special the closed issues.

However...

While testing the thingies and auditing the published WARC files, I got myself with my pants down - I managed to remove movies and images from the textual content, but I let pass... ZIP FILES! Gee, somehow Forum's is hosting not only movies, but also zip files!!

So.... I did some more hacking on the source code to handle this new situation, and I'm reworking the 202407 assets. I will (probably) update the Archive Org torrent this Sunday by night.

On a side note, firing up 4 different pywb instances so I can segregate the assets is very cumbersome. Since I already had cut my teeth on the thing with success, I'm planning to write yet something else on the thing to allow such data segregation on a single pywb instance. This will make things really easier later, if we manage to setup a pool of collaborative scrapers.

Well, laborious but productive week, I say. Now I only need some time to sleep properly... :sticktongue:

Edited January 4 by Lisias
forgot to link the issue (and scrap != scrapp!!!)

The Minmus Derp · August 4, 2024

One advantage this forum's weirdly archaic setup has over others, is that there are no issues with porting over assets or picture of any kind, its just text because all the images are embedded from elsewhere anyway! If this was the Celestia forums, I suspect that this project would be substantially harder.

Lisias · August 4, 2024

6 minutes ago, The Minmus Derp said:

One advantage this forum's weirdly archaic setup has over others, is that there are no issues with porting over assets or picture of any kind, its just text because all the images are embedded from elsewhere anyway! If this was the Celestia forums, I suspect that this project would be substantially harder.

As a matter of fact, nope.

Celestia forum download content from a internal service called /file.php and it mentions the original filename on it.

This Forum does the same using /applications/core/interface/file/cfield.php and it does something similar, and I'm parsing it to redirect the contents to the right proxy (and the proxy downloads it and shove it on the WARC file).

Sites that download content using embedded javascript code like Tenor, on the other hand, would seriously screw up our lives.

Lisias · August 6, 2024

NEWS FROM THE FRONT

I finished auditing my content scrap files (with html data), removing anything unrelated (like movies, zip files, spreadsheet documents :0.0: , you name it!).

I refactored the scraping tool to prevent polluting the content WARC files with alien data, cleaned them up from the WARC files I already have and I'm going to republish them, replacing the previous (contaminated) content, after checking integrity, compacting them again and signing.

I'm in the process to emit a (decent) report about that pesky HTTP 5xx errors that are plaguing us since some time. I found them even from the scraps from 2023.05 I found on Archive Org, details on the post below.

The problem is always there since forever, things got harsher because there're more people looking on it.

I don't think the reason is TTWO downgrading infrastructure anymore. I'm prone to blame the increasing number of AI players in the World, all of them training their AIs for the next iteration (if not the first). The sad truth is that AI hardware is selling like candy nowadays, and the guys buying that stuff are going to train their systems from scratch...

Edited August 8, 2024 by Lisias
Eternal typos of the Englishless Mind...

Devblaze · August 8, 2024

1. Github to host so we can all contribute to the scraping?
2. Have you attempted to capture static snapshots of pages similar to Webrecorder and HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

Lisias · August 8, 2024

3 hours ago, Devblaze said:

1. Github to host so we can all contribute to the scraping?

I have about ~~60G~~ 85G of data already. Github allow us to use 2GB for free.

Quote

GitHub Team provides 2 GB of free storage. For any storage you use over that amount, GitHub will charge $0.008 USD per GB per day, or approximately $0.25 USD per GB for a 31-day month.

<...> and $0.50 USD per GB of data transfer.

https://docs.github.com/en/billing/managing-billing-for-github-packages/about-billing-for-github-packages

So, no way to go Internet Archive on it without footing some money monthly. But there's already something like that, if you prefer.

3 hours ago, Devblaze said:

2. Have you attempted to capture static snapshots of pages similar to Webrecorder and HTTrack Website Copier - Free Software Offline Browser (GNU GPL)

I'm doing exactly that, using Web Recorder. However, there're some additional concerns that I'm mitigating. I had already talked about on this post, no need to repeat myself on this.

Keep in mind, however, that one of the main objectives of the endeavour is to minimize impact on Forum. The scrapy program I coded, as well some pywb customization I did aims exactly to do that - prevent hitting Forum the best it would be possible.

Edited August 8, 2024 by Lisias
Correcting the stats

Lisias · August 9, 2024

NEWS FROM THE FRONT

I had updated the torrent on Archive Org with audited and sanitized WARC files.

https://archive.org/details/KSP-Forum-Preservation-Project

There're absolutely no non user text content on the main WARC files anymore.
- the ones without a suffix
Binary and non textual contents are in dedicated WARC files.
- images for... images
- media for movies and videos and related
- styles for CSS and javascript related
- files for anything else (as zip files and spreadsheet and Kraken knows what else I will find).
- Such segregation not only helps on the legal point of view (see this post), but also helps a lot on the entropy while compressing files - and, boy, the lzr thingy really knows how to compress something (a file compressed to 5G with gz gets down to 900M with lzr!)
Added CHANGE LOG.
All WARC files were sanitized to remove:
- http 5xx errors
- http 4xx errors
- revisits where the page didn't changed
- Multiple URL recordings without the revisit tag
- Duplicated warc records (due botched file merges)
- Orphaned warc records (due its parent being sanitized out)
update forum.kerbalspaceprogram.com-202407.warc to the scraped content I did between July 28th and 31th.
From this point, unless I find something else wrong, the only files that are going to change are the redis.dump and the CHANGE LOG (and READ ME now and then).
- Anything else will be additions.

If you ever had download the thing before, I suggest you do it again. Sorry for the wasted bandwidth, I did do my best to prevent it from happening again.

All the code and configuration files I'm using are on https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project .

I spent many days doing combinatorial analysis on how would be the best setup for serving the thing. This will be properly documented, but for now, some Quick & Dirty How-TO:

Use a dedicated btrfs mount point for serving the WARC files, mount it with compress=zstd:9 .
- the btrfs zstd:9 compression is some orders of magnitude better than gzip, you will save space and time by not using the gzip option on the WARC files.
- These files will be written only once, and read infinite times - don't use a lesser compression level on this mount point. You are going to host very, very big files and space and IOPS will always be at premium.
  - It will worth the pain to take a few minutes more to copy these files there with the maximum compression
be logged on the account you will use to serve the content
- lets pretend it will be kspforumproxy
clone the repository above somewhere in that mounted file system
- lets call it /home/kspforumproxy/data
- cd ~/data
  - remember to be logged as kspforumproxy
    - try sudo -Hu kspforumproxy bash if you don't want to relog in the box
install Python 3 (the latest available for your distro) somehow in your rig and then
- ```
python3 -m venv /home/kspforumproxy/python
. ~/python/bin/activate
```
Install all the dependencies:
- pip pywb scrapy redis
  - it's scrapy with a single "p" - there's another package with two "p"s, don't install it by accident!
You will need to setup redis-server somewhere in your infrastructure.
- and openssh
- and lzr
- and almost surely more things that I forgot right now.
Download the project's torrent somewhere, like /home/kspforumproxy/torrent

Once the download is finished, verify the files:

(deck@steamdeck kspforumproxy)$ cd /home/kspforumproxy/torrent
(deck@steamdeck torrent)$ cd KSP-Forum-Preservation-Project
(deck@steamdeck KSP-Forum-Preservation-Project)$ ./verify.sh
Good "allowed_signers.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "CHANGE_LOG.md.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-00000.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-00001.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-202407-files.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-202407-images.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-202407-media.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-202407-styles.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "forum.kerbalspaceprogram.com-202407.warc.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "README.md.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms
Good "redis.dump.lrz.sig" signature for net.lisias.ksp-Forum-Preservation-Project with RSA key SHA256:u/3NUYl7q7X4kTnQoImJIANkK4D5ClXDeTBcqW1r7ms

The allowed_signers file should be binary equal to the one you will find on the github repository.
- I added it here for convenience, and yeah, I know, it would be better to force the user to download it from the github repo.
  - And you know that they would not do it anyway.
Any other message means that at least one of the files failed the integrity check, and should not be used.
- Call me, something very wrong is happening, I'm signing these files for a reason!

decompress everything
- lzr -d --keep *.lrz
- go get a snack, this is going to take a while...
- when finished, mark everything are read/only to prevent accidents:
  - chmod -wx *.warc

now, move the WARC files to their respective destinations, as follows:

forum.kerbalspaceprogram.com-00000.warc			-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive
forum.kerbalspaceprogram.com-00001.warc			-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive
forum.kerbalspaceprogram.com-202407-files.warc		-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com-files/archive
forum.kerbalspaceprogram.com-202407-images.warc		-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com-images/archive
forum.kerbalspaceprogram.com-202407-media.warc		-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com-images/archive
forum.kerbalspaceprogram.com-202407-styles.warc		-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com-styles/archive
forum.kerbalspaceprogram.com-202407.warc		-> ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive

This may be reworked futurely, things may be made simpler for people willing only to serve the proxy, and not on scraping.

now you need to rebuild the proxy's indexes:
- ```
cd ~/data/Source/ARCHIVE/spider
./reindex-all
```
- And got get another snack, as this is going to take a while again.

and now you can finally fire up the proxy:

cd ~/data/Source/ARCHIVE/forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com
wayback --proxy forum-kerbalspaceprogram-com -p 8080 -b localhost

2024-08-08 20:04:32,344: [INFO]: Proxy enabled for collection "all"
2024-08-08 20:04:32,451: [INFO]: Starting Gevent Server on 8080

And voilà! Now you have your own personal Internet Archive style site to call your own!

And, yes, it looks like crap - I didn't finished to scrap everything before July 31th.

On the other hand, the All Activity is working fine:

2024-08-08_serving-index.png?raw=true

So still not bad for a first time.

You can also navigate the site from "live" web using this proxy by

http://localhost:8080/live/https://forum.kerbalspaceprogram.com/discover/

Convenient for comparing results.

You can configure your browser to use http://localhost:8080 as a http proxy, so everything your browser does will go to your local wayback machine first and if the URL is archived, it will be shown instead of the "live" web.

Now, some thoughts about further scrapings:

imgur is not going anywhere in the short term, I'm not sure if it worth the pain to archive it
- But if I will, I will do it on a new dedicated project to shield this project from copyright issues
- Same for personal image servers.
I will not scrap Discord.
- Reason: 20. Do not mine or scrape any data, content, or information available on or through Discord services (as defined in our Terms of Service).
  - And Discord is going litigious on the matter due these pesky A.I scrapers, so I don't want to risk the project doing it.
- Yes, you can do it if you want for your own archiving purposes
- Yes, I can help you setting up the rig for such
- No, I will not publish these data on my torrent.
  - But I can help you creating your own, if you want.
And I need to study a bit more about how to use memento and other pywb shenanigans to make things work better.

=== == = CHANGE LOG = == ===

2024-0809:1200Z
- fixing the chmod command, as suggested by @jost below.

Edited January 4 by Lisias
MOAR fixes. (and scrap != scrapp!!!)

jost · August 9, 2024

10 hours ago, Lisias said:

when finished, mark everything are read/only to prevent accidents:

chmod -x *.warc

Shouldn't this be chmod -wx *.warc?

Edited August 9, 2024 by jost
Typo

Tw1 · August 10, 2024

I'm just dropping by to salute the amazing work you guys are doing

images?q=tbn:ANd9GcSjlzKPmknobw_ERngqUhO

Is your intend to make this purely an archive? Or to have a backup forum ready should this one to down?

Lisias · August 10, 2024

6 hours ago, Tw1 said:

I'm just dropping by to salute the amazing work you guys are doing

6 hours ago, Tw1 said:

Is your intend to make this purely an archive? Or to have a backup forum ready should this one to down?

Currently, only archive.

I remember a discussion about hosting a Forum (part on the early pages of this thread, part on another one IIRC), but my take on it is that footing money on hosting a Forum now would not only be wasteful, but probably deleterious for the Scene.

We would not manage to get sponsorship, because Companies do sponsorship to get visibility, and a backup Forum idling while waiting for the main one to die is not a good way to get such visibility. But yet someone would have to foot the bills - and by then, for what? Whoever pay the bills usually are the one calling the shots, and we need to consider this if we ever face the need to replace this Forum: who do you want calling the shots on (at this time) hypothetical Forum replacement?

Anyway, a "backup Forum" will not be able to republish this one's content as this would create a "derivative", way beyound the scope of Fair Use, Fair Dealing, or whatever legal or juridical exempt a Country allows for using copyrighted material. It will really have to start from scratch - but it can index the Internet Archive's data and, if by any reason that one falls, a replacement Archive made with the material I'm building.

Lisias · August 18, 2024

NEWS FROM THE FRONT

Still working on scraping Forum's contents that missed the 202407 release, as well new content.

In the process, I'm now eye balling all the logs trying to find out ways to minimize hits on Forum. Found some:

At least until last week Forum used to beg for mercy using HTTP 429 Too Many Requests . But scrapy not only ignores it, but just doesn't knows about it, logging "Unknown Status" what ended up preventing me from connecting the dots and missing an excellent opportunity to pause scraping when Forum is being overloaded.
- I wrote a new middleware to handle it, as well doing the same for any 5xx received.
- My reports for the 5xx occurrences will be affected, as I will lose the ability to keep a close eye on these occurrences, but I will still log something and, so, a still useful report can be issued - only way less detailed.
The same log analysis revealed what's now a pretty evident shortcut on scraping:
- For each page of a thread, there're something between 1 and 25 requests that returns a HTTP 301 Moved Permanently related to direct links to a specific post on that thread.
- Since I'm intending to mimic Internet Archive and provide a navigable Archive, I need to scrap these HTTP 301 - but only once.
  - Once I had scraped it one time, this will never, ever change again and so I can add it to a black list to never be scraped again, preventing hitting the Forum
- Same thing for abbreviated URLs. When addressing a thread, Forum really cares only about the number after the / and before the -. For example, the URL https://forum.kerbalspaceprogram.com/index.php?/forum/53-x/ generates a HTTP 301 to https://forum.kerbalspaceprogram.com/forum/53-general/, and from that point, I don't need to ever hit Forum again using https://forum.kerbalspaceprogram.com/index.php?/forum/53-x/.
Rebuilding the ignore-list with the Redirects revealed to be cumbersome, as they are all merged with the Forum's user contents and I had to parse everything - about 90 Gigabytes until 202407.
- So I wrote a small utility to split the redirections from content on the current dataset, archiving them on separated files. This allows me to quickly build an updated ignore-list before a scraping session.
- I hope this will help to prevent some hits in the future.
Threads that change the title regularly, as Add'Ons' release threads, are currently the focus on my attention - some of these threads are really huge, and re-scraping everything every time the thread changes the Title sounds like a waste for me.
- But not doing it may cause confusion as the new pages will not properly link into the older pages as scraping continues.
I scraped into a dedicate WARC file some legal pages that doesn't belongs to Forum, but affects the terms in which we use it. It will be added to the IA's Torrent this weekend (as soon as I sanitize it and double check the contents to avoid republishing it again)
Complementing my "sanitizer" tool, I'm finishing coding another one that will be responsible for splitting the scraped data from a monolithically WARCball into the specialized WARC files - not only for legal reasons, but also from practical ones as the compression hugely improves when we avoid contaminating the highly compressible stream with already somewhat entropic data.
- Since I ended up writing this tool, I will simplify the current pywb architecture making way easier to setup the rig.
  - I'm going to use that tool on the WARC files anyway, no need to segregated them at scrapingtime anymore - so, why bother?
- I already reached the limit about what can be done on this (now archaic) architecture, as it forces me to know beforehand the result of the request, what's plain impossible on a HTTP Redirect. Since I was forced to write this splitter tool anyway, there's no reason to not further use it on the whole segregation process taking advantage of the Response already on the WARC file.

For the curious people, follows the simplified report of the splitting tool in action:

Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-00000.warc
        Found 245820 redirect records found.
        390886 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-00001.warc
        Found 81682 redirect records found.
        162248 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-202407.warc
        Found 342338 redirect records found.
        236470 * records found.
        49 files records found.
        5 movies records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240811131631900139.warc
        Found 23 redirect records found.
        70613 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240812082950967614.warc
        Found 97 redirect records found.
        19029 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240812191456172206.warc
        Found 50 redirect records found.
        6901 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240813010523547341.warc
        Found 16470 redirect records found.
        24966 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240813205303128187.warc
        Found 23478 redirect records found.
        29387 * records found.
        5 images records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240815130028563811.warc
        Found 48682 redirect records found.
        51656 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240817001954848148.warc
        Found 30844 redirect records found.
        26028 * records found.
Processing ../forum.kerbalspaceprogram.com/collections/forum-kerbalspaceprogram-com/archive/forum.kerbalspaceprogram.com-20240817164016690427.warc
        Found 7202 redirect records found.
        7958 * records found.

Note: "* records" are anything that is not a redirect, a movie, an image, a file or a style related asset - hopefully, user content. Each scraped item necessarily have two records, the Request and the Response, both are archived.

The Redirect/Ratio is significant, sometimes near 1...

Well, I'm currently revising (again) the 202407 release and recompressing the data. I will update the torrent on Internet Archives (the repository is being kept up to date as soon the artifacts are working) and advise here.

=== == = POST EDIT = == ===

Finally I managed to put my sheets together :sticktongue: and properly handled the http 429 thingy.

2024-08-18 15:51:30 [forum.kerbalspaceprogram.com] INFO: Site complained about "502". Retrying <GET https://forum.kerbalspaceprogram.com/topic/160140-122-kolyphemus-system-the-kerbalized-polyhemus/?do=findComment&comment=3107347> in 900 seconds.
2024-08-18 16:11:41 [forum.kerbalspaceprogram.com] INFO: Site complained about "502". Retrying <GET https://forum.kerbalspaceprogram.com/profile/178446-findingclock4/content/?type=core_statuses_status&sortby=status_content&sortdirection=asc> in 900 seconds.
<....>
2024-08-18 20:51:21 [forum.kerbalspaceprogram.com] INFO: Site complained about "429". Retrying <GET https://forum.kerbalspaceprogram.com/tags/camera%20focus/?_nodeSelectName=cms_records1_node&_noJs=1> in 60 seconds.
2024-08-18 20:52:31 [forum.kerbalspaceprogram.com] INFO: Site complained about "429". Retrying <GET https://forum.kerbalspaceprogram.com/tags/focus%20bug/?_nodeSelectName=cms_records4_node&_noJs=1> in 60 seconds.
2024-08-18 20:53:32 [forum.kerbalspaceprogram.com] INFO: Site complained about "429". Retrying <GET https://forum.kerbalspaceprogram.com/tags/focus%20bug/?_nodeSelectName=cms_records1_node&_noJs=1> in 60 seconds.

I lost the ability to closely monitor the http 5xx responses, so any report will not be minutely accurate as the previous ones - but at least I managed to get myself out of chain of events and ceased to be a possible contributing factor: as soon as the site cries for mercy, now I hear it and leave it alone by the time it tells me to do it (usually 60 seconds). Nasty 5xx errors triggers a 900 seconds (15 minutes) embargo on the spot, no questions asked.

Day Job© duties prevented me from finishing the revised 202407 artifacts. I will update this post as soon as I manage to update the IA's torrent.

=== == = POST² EDIT = == ===

Torrent on Internet Archive updated.

https://archive.org/details/KSP-Forum-Preservation-Project

CHANGE LOG:

Spoiler

2024-0820 : MOAR Cleaning up
- Added *-legal.warc file with the Forum's and TTWO's legalese in a single packet
  - These terms are not hosted on Forum, so I cooked a dedicated spider just for them.
    - Ended up helping me out to understand some mistakes on the Forum's spider!
  - I don't expect these are going to change anytime soon - but, better safe than sorry.
- Some dirty was found left on the 202407 scrapings. Cleaned them out.
  - The HTTP 301 Moved Permanently records were moved into dedicated WARC files to facilitate intenal proceedings.
  - Revisits with same digest profile are now cleaned out from the WARCball.
    - This is saving a lot of space!
- Removed redis.dump as it became outdated and regenerating it for an old release is cumbersome.
  - Use redo-redis after updating your WARC files.
2024-0807 : Ubber Sanitizing & Republish
- Previous content was further audited, sanitized and cleaned up
  - Removed HTTP 4xx responses (and respective requests)
  - removed duplicated records on files, images, media and styles archives
  - completely remove anything not user generated from the main WARC file
- Adds scraped content up to July 30th
2024-0806 : Republish
- Previous content was audited and cleaned up
  - Removed non html content that leaked from the main WARCs
  - Removed HTTP 5xx responses (and respective requests)
  - Includes the 202305 content that was added ipsi literis
- Adds scraped content up to July 28th
2024-0729 : Adding signatures
- Adds RSA signatures for all artefacts
2024-0728 : First public Release
- Stuff from 202305 added for conveniency
- My own scrapings from July 13th to 27th
- Forum Announce

Project: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project

Edited January 4 by Lisias
POST² EDIT (and scrap != scrapp!!!)

Lisias · August 25, 2024

NEWS FROM THE FRONT

Scrapings are ongoing, but in a lower pace since I first implemented the 429 delays (see below) last weekend.

Some graphs with the (perceived) Forum status in the last 7 days (GMT-3 timezone):

20240825.Events.png

HTTP 429 are harmless, it's Forum telling me that I'm hitting it too much. I wait at least 60 seconds each time I receive one. They are, fortunately, the absolute majority of occurrences for me now. I also implemented an exponential (to tell you the truth, logarithmic) backoff so if I start to receive too much 429s in an hour, I increase a bit the delay up to 60 minutes. So, yeah, scraping can get waaay slower now.

The nasty stuff are the 5xx ones. The incidence dropped drastically except between Aug 21th, 13:00 (GMT-3) and Aug 24th 8:00. Don't have a clue about the reason. Each 5xx incurs on a 15 minutes embargo minimum, also with a logarithmic backoff up to 60 minutes.

Columns without any bars are timestamps without any occurrences other than HTTP 2xx or 3xx.

The following chart depicts the scraping times:

Spoiler

20240825.Connections.png

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/tree/master/torrent/reports

Edited January 4 by Lisias
Fixing chart links (and scrap != scrapp!!!)

Bioman222 · August 27, 2024

So on the preserving-an-active-forum side of things, maybe we could do a self-hosted alternative. The fediverse has become very popular in the past few years with the whole twitter/X happenings, and that is a +1 million user network ran almost exclusively by volunteers and non-for-profits. There’s a Reddit equivalent called “Lemmy”, and there’s an alternate front-end for Lemmy that makes it look like an old school forum called “lemmybb”. Maybe that could be a promising avenue if things go south here? Reddit’s interface/algorithim just isn’t very friendly to mod development since old threads sink to oblivion after a few days, no matter how active they are, and things like the “What are you doing in KSP today” thread just isn’t possible on Reddit. There’s also “discourse” (not to be confused with Discord) that’s a similar type of software.

Lisias · August 27, 2024

3 hours ago, Bioman222 said:

So on the preserving-an-active-forum side of things, maybe we could do a self-hosted alternative.

One would not be able to legally host the current content, as it would be IP infringement.

The posts may be still be owned by the original poster (see here), but the poster had agreed to license it to Forum's owner. In order to legally host the current content, you would need to get a license from each poster - or convince the current Forum's owner to relicense the content to you (a right also granted by the original poster by posting here, under the ToS).

The best outcome possible is to Forum just keep going, rendering all our efforts a waste of time.

If Forum goes down, the less worst alternative is a IA style Archive (always keeping in mind that I.A. can take some content down due some litigation, so it's not wise to blindly rely on them), this one being mirrored (or plain hosted) by the alternatives trying to cut their teeth on the Scene in the exact same terms Internet Archive hosts their copy - as a Read Only, immutable but browsable (and, futurely, searchable) Archive with the content.

Bioman222 · August 27, 2024

28 minutes ago, Lisias said:

One would not be able to legally host the current content, as it would be IP infringement.

The posts may be still be owned by the original poster (see here), but the poster had agreed to license it to Forum's owner. In order to legally host the current content, you would need to get a license from each poster - or convince the current Forum's owner to relicense the content to you (a right also granted by the original poster by posting here, under the ToS).

The best outcome possible is to Forum just keep going, rendering all our efforts a waste of time.

If Forum goes down, the less worst alternative is a IA style Archive (always keeping in mind that I.A. can take some content down due some litigation, so it's not wise to blindly rely on them), this one being mirrored (or plain hosted) by the alternatives trying to cut their teeth on the Scene in the exact same terms Internet Archive hosts their copy - as a Read Only, immutable but browsable (and, futurely, searchable) Archive with the content.

I think it was unclear in my original reply, I don't mean making a forum with all the old posts, I get that's legally impossible. I just mean an unofficial fresh forum that we can keep posting in in the event forum.kerbalspaceprogram.com goes down. That would be a separate project from archiving the old one, I see that there was discussion of having alternatives to keep posting on earlier in this thread. I assume IA does not have a way to continue an active forum.

Lisias · August 27, 2024

24 minutes ago, Bioman222 said:

I think it was unclear in my original reply, I don't mean making a forum with all the old posts, I get that's legally impossible.

Oh, sorry. Being a thread in which I was talking for weeks only about archiving, I didn't noticed it by being somewhat biased on archives (this uphill battle have all my attention lately).

IMHO yes, we should be considering what to do of the worst happens. But there will be some challenges on this approach, I want to suggest the following posts as a starting point for how to accomplish that:

Concerning about name grabbing is pretty relevant.

Additionally:

(See also .my reply below)

purpleivan · August 31, 2024

On 8/27/2024 at 10:27 PM, Lisias said:

IMHO yes, we should be considering what to do of the worst happens. But there will be some challenges on this approach, I want to suggest the following posts as a starting point for how to accomplish that:

Only came across this thread this evening, but have now read through it to get an idea of the current state of play.

Regarding "what to do if the worst happens", is there a plan in place regarding communication, if the forum was to disappear imminently (say, 5 minutes from now). As I understand it @Lisias efforts to create an archive copy of the forum are going well, but if the forum went "pop" suddenly, are those involved in this effort, set up to communicate with each other outside this forum.

Additionally, in that situation how would other not involved in the archive effort, but who might be interested in access to the archive (either immediately, or perhaps weeks/months later), be able to find out how to do that?

I'm concerned that the effort to archive the forum might not reach its potential, if the many who might want to get access to it (if they knew it existed) weren't able to find out about it, due to the forum's demise.

Off the top of my head, possibly a prominent post (with help of the mods) in a pinned thread (possiblya new thread being added), to say that this is what is being done and should the forum disappear suddenly, further news will appear in these other forums/discords etc.

I've had a pretty long day, so might not be making sense, but hopefully I am.

Lisias · August 31, 2024

14 hours ago, purpleivan said:

As I understand it @Lisias efforts to create an archive copy of the forum are going well, but if the forum went "pop" suddenly, are those involved in this effort, set up to communicate with each other outside this forum.

Yes.

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project

https://archive.org/details/KSP-Forum-Preservation-Project

The IA's torrent is updated at least once a month (as a matter of fact, I'm working on the August deliverables right now).

14 hours ago, purpleivan said:

Additionally, in that situation how would other not involved in the archive effort, but who might be interested in access to the archive (either immediately, or perhaps weeks/months later), be able to find out how to do that?

Yes. But I admit I need to do a better job on the project's documentation.

14 hours ago, purpleivan said:

I'm concerned that the effort to archive the forum might not reach its potential, if the many who might want to get access to it (if they knew it existed) weren't able to find out about it, due to the forum's demise.

There're currently at least 9 full downloads of the IA's torrent, so I think this is currently mitigated. Not to mention the github project itself, where essentially every single bit of information available (with full historic) is available to anyone willing to take over if by any reason I... humm.. "disappear". :sticktongue:

14 hours ago, purpleivan said:

Off the top of my head, possibly a prominent post (with help of the mods) in a pinned thread (possiblya new thread being added), to say that this is what is being done and should the forum disappear suddenly, further news will appear in these other forums/discords etc.

I'm unsure if this would be a good idea - I'm working on the expectation of Fair Use (or Fair Dealing, if you are British). I'm unsure at which point the Forum's moderators would be willing to stick with me on something that is, well, unable to have a proper license for starters.

I have no reserves other than that - if Forum's Moderation gives me a green flag, I will create a thread for it on whatever Sub Forum it would be allowed.

The reason I'm posting news on this thread (and on this thread only) and without advertising it is that, if something bad happens, it's easier to nerf a single thread than to hunt potentially offending posts scattered around Forum.

Edited September 1, 2024 by Lisias
Entertaining grammars made slightly less entertaining.

purpleivan · September 1, 2024

ok, thanks @Lisias and understood regarding all that, especially the issues regarding Fair Use.

Lisias · September 3, 2024

NEWS FROM THE FRONT

Deliverables for August 2024 available now on IA's Torrent.

https://archive.org/details/KSP-Forum-Preservation-Project

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project

Spoiler

2024-0902 : Scrappings for August 2024
- Added scrapped content for August 2024
- redis dump is back

I will stop scraping for a week to reevaluate some approaches. Not everything gone as I would like this month, I could had did a lot more.

Cheers.

--- -- - UPDATE - -- ---

Found and fixed a mistake on the redis dump.

Edited January 4 by Lisias
UPDATE (and scrap != scrapp!!!)

KSP Forums Archival Options.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation