Jump to content

KSP Forums Archival Options.


Recommended Posts

On 7/14/2024 at 8:44 PM, HB Stratos said:

Sounds great! I'd definitely be willing to help out. my own efforts are running into hardware and software issues, so I feel consolidating our efforts makes a whole lot of sense.

I have the tool working since early Saturday, and the think works. Setting it up is a bit of a pain in the SAS, but IMHO worth the pain.

I'm gradually building up instructions here: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project, and, unsurprisingly forked the pywb project to publish some small fixes I did (or are still doing) here: https://github.com/Lisias/pywb/tree/dev/lisias .

There's an additional benefit of my approach (I gone the hard way for a reason - on the UNIX way!!!): the scrapper script can be customized for distributed processing - there're about 450K pages around here, we set up a pool of handful and trusted collaboratours, we will be able to keep the mirror updated with relatively low efforts from our site (2 people doing half the job at the same time is way faster) and less load on Forum (as it would be way better than the 2 people doing all the job for themselves).

Disclaimer

Spoiler

redis is using less than 100 Mega RAM, pywb about 110 and scrapy took 282.

The whole stack is running on my Steam Deck with a USB3 harddrive but, frankly, this is way overkill.

A Raspberry PI 4 (with a USB2 hard drive) would do the job the same - using BTRFS and compression to way faster disk I/O.

Hell, I think I can walk with a Raspberry PI 3!!

Any filesystem with about 256GB will do, but since the thing is somewhat write intensive, I would plug a 1TB external spinning disk on it (again, using BTRFS and compression to the maximum.

At this moment, I have:

-rw-r--r-- 1 deck deck  41G May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-rw-r--r-- 1 deck deck  20G May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-rw-r--r-- 1 deck deck 221M Jul 13 04:29 forum.kerbalspaceprogram.com-20240713061810129595.warc
-rw-r--r-- 1 deck deck 1.3G Jul 13 09:44 forum.kerbalspaceprogram.com-20240713081235884757.warc
-rw-r--r-- 1 deck deck 6.7G Jul 15 02:48 forum.kerbalspaceprogram.com-20240713124446675422.warc

and about 140919 pages fetched.

The "00000" and "00001" files are the ones I downloaded from the Web Archive on the link I posted early. The "2024*" ones are the current working in progress.

You see, sharing these files will be easy between us, each WARC file is self-contained and complementary to the others. Since I'm using BTRFS, the files are uncompressed - gzipping them should give us about 8 to 1 compression, while going zlib on the BTRFS gives me about 8 to 1 - but it makes simpler and faster to access the WARC contents.

EDIT: zstd:15 gives us nearly 8.3 to 1 !! It's the best option, if you don't want to handle WARC.gz files.

Monday night, after Working Hours, I will further update the github repository with the instructions to fire up the scraping infrastructure. Then, hopefully, we can start discussing how to go multiprocessing with the thing.

=== == = POST EDIT = == ===

I think that the best way to distribute the files will be by torrent. Torrent files can be updated, by the way, and so one single torrent will be enough for everybody. Since to serve the files using the WARC.gz format will be probably the best option, the host serving the mirror can also help to distribute the files via torrent - but keep your quotas in check!

Edited by Lisias
POST EDIT
Link to comment
Share on other sites

All the archives are great for preserving the content, but they are not exactly in an easily accessible format and there is still the question about the future for the community. Discord and Reddit are not viable options for the content type; a traditional forum is certainly best. There are a multitude of considerations to make with that though.

A forum this size needs some decent hardware to run reliably, which is likely to be the biggest cost factor. Without knowing specifics about what's currently in use, not that is matters much given every forum software has different demands, I would say that's an easy $200 per month in expense just to get something big enough to host everything and what might be to come in future. That's operating at the edge of stability mind you, it could easily be more.

A new domain needs to be setup and that incurs costs per year as well. Naturally you'd want to associate it as closely as possible, but therein sits an issue with copyright and trademarks. Setting everything up only for T2 to complain to the registrar about the use of their registered name would be rather annoying and potentially costly.

Forums require software to run them and this forum currently uses licensed software that can cost quite a bit depending on deployed scale and features. I doubt they'd be nice and provide a license for free, so in order to reduce costs for a revival project a free forum software would need to be used. There are a few, but they may not have the features people are accustomed to and whether they work at-scale is questionable.

I can speak for years of experience on how difficult it is to set something up on these scales and it would be really sad to see if a continuation effort were to stall on the fact the load has been underestimated. It may seem like starting small and upgrading things as it becomes necessary is the way to go, but for an active community, that has recently seen an uptick as well, that will just lead to constant overload. Sounds counter-intuitive, but starting perhaps a little too big and downgrading later to reduce costs is likely to be a better option. Given the situation some providers may be willing to extend some credits for testing a setup for stability.

I could ask some industry contacts for whether their companies would offer credits for setting up machinery to run a continuation of the forum. Domains are easily registered once a name is actually found. The rest is basic sysadmin stuff that I'm sure enough folks on here are familiar with and thus could provide the necessary hours for upkeep. Leaving only the long-term funding as something that would need to be looked into.

Link to comment
Share on other sites

12 hours ago, TampaPowers said:

All the archives are great for preserving the content, but they are not exactly in an easily accessible format and there is still the question about the future for the community.

Working on it. Once I manage to publish the pywb archive, the next step will be a search engine. Interesting enough, this last step will be the easier - I already have a FTP search engine project for retro-computing working (on a bunch of raspnberry pis!!), and if we dig enough, I'm absolutely sure we will find even better solutions nowadays (mine was a novelty 5 years ago).

 

12 hours ago, TampaPowers said:

Discord and Reddit are not viable options for the content type; a traditional forum is certainly best. There are a multitude of considerations to make with that though.

Discord was an experiment that gone bad.

Reddit is less worst, but the site's format is not the best for what we need. I agree, this Forum is the best format.

 

12 hours ago, TampaPowers said:

A forum this size needs some decent hardware to run reliably, which is likely to be the biggest cost factor. Without knowing specifics about what's currently in use, not that is matters much given every forum software has different demands, I would say that's an easy $200 per month in expense just to get something big enough to host everything and what might be to come in future. That's operating at the edge of stability mind you, it could easily be more.

Orbiter-Forum is running with 240USD/month, if we accept the last round of donations as a source this information.

I think this Forum, right now, would need something more due the larger workload.

What we would really need, assuming this Forum will be decommissioned, would be a Federated model with many voluntary servers running under some kind of distributed operating system. Boy, I miss the times in which Plan9 could be something...

 

12 hours ago, TampaPowers said:

A new domain needs to be setup and that incurs costs per year as well. Naturally you'd want to associate it as closely as possible, but therein sits an issue with copyright and trademarks. Setting everything up only for T2 to complain to the registrar about the use of their registered name would be rather annoying and potentially costly.

This is where I think things would not be so bad. If T2 decides to complain, it would be because they want to do something with the IP - what means that Forum will be alive.

We need to keep in focus that we are not working to replace the Forum, we are working to guarantee content preservation and to have a lifeboat available if the ship sinks.

It's still perfectly possible that we could be just overreacting, and nothing (still more) bad are going to happen and Forum will be available for a long time.

 

12 hours ago, TampaPowers said:

Forums require software to run them and this forum currently uses licensed software that can cost quite a bit depending on deployed scale and features. I doubt they'd be nice and provide a license for free, so in order to reduce costs for a revival project a free forum software would need to be used. There are a few, but they may not have the features people are accustomed to and whether they work at-scale is questionable.

IMHO, if we are going the extra mile and setup a Forum to be used in the unfortunate (and, at this time, hypothetical) absence of this one, we should consider going Open Source the most we can to keep the costs down.

I agree that closed/licensed solutions are way more polished, but a non-profit community that will rely on donations (at best) and/or sponsoring (more probably) need to keep the costs down.

Voluntary work is cheaper than licensed Software.

I think we need to look around and see what are the current alternatives - but something we may be sure: it will not be exactly like this Forum for sure.

 

12 hours ago, TampaPowers said:

I can speak for years of experience on how difficult it is to set something up on these scales and it would be really sad to see if a continuation effort were to stall on the fact the load has been underestimated. It may seem like starting small and upgrading things as it becomes necessary is the way to go, but for an active community, that has recently seen an uptick as well, that will just lead to constant overload. Sounds counter-intuitive, but starting perhaps a little too big and downgrading later to reduce costs is likely to be a better option. Given the situation some providers may be willing to extend some credits for testing a setup for stability.

The problem I see is that the distributed model initially envisioned for the Internet was murdered and buried by commercial interests.

The ideal solution would be distributed computing, with many, many, really many small servers volunteered by many, many, really many individual contributors.

We are having this problem on Forum exactly due the monolithic nature of the solution (that matches the commercial interest of the owner).

This "business model" is unsuited for a non-profit community effort.

Granted, I'm unaware of any other widely adopted alternative. I doubt we could go WERC on this one.

 

12 hours ago, TampaPowers said:

I could ask some industry contacts for whether their companies would offer credits for setting up machinery to run a continuation of the forum. Domains are easily registered once a name is actually found. The rest is basic sysadmin stuff that I'm sure enough folks on here are familiar with and thus could provide the necessary hours for upkeep. Leaving only the long-term funding as something that would need to be looked into.

Sponsorship, IMHO, is going to be the best chance. But how to gather sponsor on a project those existence depends of the failure of this Forum?

"Here, we are asking for some donations to keep this new Forum - but it will not be used, unless the main one goes down..."

Companies sponsor things for a reason: they want some visibility in exchange, "look at us, we are sponsoring this!". They will not get this counterpart unless the thing goes live for good, aiming to replace this Forum - something that, to the best of my knowledge, is not the aim of all this effort.

Link to comment
Share on other sites

NEWS FROM THE FRONT

Some unexpected events on Day Job© prevented me from properly writing a proper Scrapping KSP for Dummies instructions, so for now I'm just uploading the configuration files so someone could start a new dataset if desired. This is the hard part, anyway, unless you had already did it before (obviously) because there're so many ways to setup the thing wrongly... :P

The scrapper is still monolithic, I gave some thought on the distributed efforts but didn't coded anything yet. I didn't spent a sec on scraping your MY personal messages neither.

Suggestions are welcome.

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project

Dig your way on the files, there're good information in scripts and code too.

DO NOT try to accelerate the scrapping, currently the thing is doing 1 page/sec, with exponential backoff to up 5 minutes. This is the best compromise I reached to allow the thing to run 24x7 -  from Friday night to Saturday Midday you can walk from being a bit more aggressive, but I concluded the earnings weren't good enough to worth reworking the scrapper to do differently on this time window.

Trying to make things faster for you will make Forum slower for everybody, including you.

Please mind the rest of the users.

I did a small modification on pywb to allow it to create non compressed WARC files while scrapping, as I'm using BTRFS and, surprisingly, the compression ratio is even better using it. Fixes and updates on the documentation will be applied here before a pull request is made to the upstream - any help would be welcome too.

https://github.com/Lisias/pywb/tree/dev/lisias

Currently, I had archived 284.447 responses (pages et all) from Forum. For the sake of curiosity, IA have 2.164.265 - but it includes imgur images and the older URI scheme too, so the Forum pages are duplicated. There're instructions on my repo above to download the IA's CDX and extract this information from it.

-rw-r--r-- 1 deck deck  41G May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-rw-r--r-- 1 deck deck  20G May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-rw-r--r-- 1 deck deck 1.5G Jul 13 09:44 forum.kerbalspaceprogram.com-20240713061810129595.warc
-rw-r--r-- 1 deck deck 9.4G Jul 15 13:31 forum.kerbalspaceprogram.com-20240713124446675422.warc
-rw-r--r-- 1 deck deck 4.0G Jul 17 02:27 forum.kerbalspaceprogram.com-20240715163142475518.warc
-rw-r--r-- 1 deck deck 802M Jul 16 12:36 ia.forum-kerbalspaceprogram-com.cdx
-rw-r--r-- 1 deck deck 277M Jul 16 12:50 ia.uri.txt
-rwxr-xr-x 1 deck deck  183 Jul 16 12:10 uri.sh
-rw-r--r-- 1 deck deck  32M Jul 17 03:57 uri.txt

I'm using BTRFS with zstd:15 compression for this job (too high compression is not advisable for normal use!), and the results are similar at worst with the gzip solution, with the benefit you will not need to recompress the WARC file after scrapping, and grepping the WARC files are way more convenient:

(130)(deck@steamdeck archive)$ sudo compsize *
Processed 9 files, 615225 regular extents (615225 refs), 1 inline.
Type       Perc     Disk Usage   Uncompressed Referenced
TOTAL       12%      9.6G          75G          75G
none       100%      531M         531M         531M
zstd        12%      9.0G          74G          74G

You can change the compression (and level) for a specific file on BTRFS as follows:

(deck@steamdeck archive)$ btrfs property set forum.kerbalspaceprogram.com-20240715163142475518.warc compression zstd:15
(deck@steamdeck archive)$ btrfs filesystem defrag -czstd forum.kerbalspaceprogram.com-20240715163142475518.warc

But, usually, the best way is to remount the volume before doing the job, and then unmounting and mounting again to reset the settings.

sudo mount -o remount,compress=zstd:15 /run/media/deck/MIRROR0/

And, yes, it's compress on the /etc/fstab and compression on the btrfs properties. :confused: Go figure it out...

The memory consuption for pywb and redis are pretty low, 97M and 170M at this moment. But the scrapy process is eating 2.2G from the host machine - I don't know if this is due the machine having memory to spare and being currently dedicated to the job, but right now this is not a concern.

And you can still play some games on the Deck while it does the scrapping!!! :sticktongue:

Edited by Lisias
Ah!! Now I got why moderators thought I could be trying to scrap 3rd parties personal messages! Changes in BLUE text.
Link to comment
Share on other sites

NEWS FROM THE FRONT

Scrapping with pywb is, well, slow. I think the way they do deduplication is terribly unperformant - as the time passes and the redis database grows (and, boy, it grows!), the pages/minute ration drops and drops again.

This can't be fixed right now, so I'm just taking the hit.

So I decided to prioritize archiving the content without any styling and images, this should reduce a lot the redis hits preventing things from going even more slowly.

Of course the constants Bad Gateways are not helping neither - what prompts to quote myself:

On 7/17/2024 at 2:44 AM, Lisias said:

Trying to make things faster for you will make Forum slower for everybody, including you.

Please mind the rest of the users.

Currently, I have archived 425.827 pages from Forum, the old WARCs from 2023 have 440.387 - so I'm near the end (hopefully).

The Internet Archive's CDX tells me they have  2.164.265 URIs - but it includes imgur images and the older URI scheme too, so most Forum pages are duplicated there.

I'm guessing I will conclude the html pages by tomorrow night and then I will focus on the images and styles.

The current filesystem usage is:

-rw-r--r-- 1 deck deck 43543215443 May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-rw-r--r-- 1 deck deck 20403921555 May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-rw-r--r-- 1 deck deck  1609457399 Jul 13 09:44 forum.kerbalspaceprogram.com-20240713061810129595.warc
-rw-r--r-- 1 deck deck 10000118024 Jul 15 13:31 forum.kerbalspaceprogram.com-20240713124446675422.warc
-rw-r--r-- 1 deck deck 10000038580 Jul 19 01:31 forum.kerbalspaceprogram.com-20240715163142475518.warc
-rw-r--r-- 1 deck deck   625296777 Jul 19 05:27 forum.kerbalspaceprogram.com-20240719043104692502.warc
-rw-r--r-- 1 deck deck    11524830 Jul 19 05:23 imgs.txt
-rwxr-xr-x 1 deck deck         183 Jul 16 12:10 uri.sh
-rw-r--r-- 1 deck deck    89496075 Jul 19 05:24 uri.txt

Given the timetamps on the filenames, I'm managing to scrap about 5GB of text a day on weekends, and 2.5GB on working days. Again, my enemies are the Bad Gateways, due the exponential backoffs (increasing delays on every error), and the increasingly slower deduplication.

I'm not planning to archive imgur (or similar services) images, at least not at this point. If the worst happens, they will still be there and we can scrap them later with the most pressuring issues tackled down. Currently, there're approximately 207.389 unique img srcs in my archived pages, being 130.641 from imgur.

=== == = POST EDIT = == ===

And, nope, I don't have (yet) the slightest idea about how to archive save personal messages. :) (as initially discussed here)

Edited by Lisias
POST EDIT revision
Link to comment
Share on other sites

2 hours ago, Lisias said:

And, nope, I don't have (yet) the slightest idea about how to archive personal messages. :)

I would recommend not even trying, aside from your own personal correspondence, which you’d be free to do anyways via copy paste.    
I’d say that with out a specific reason to save an individual message, there probably isn’t a reason to back up your inbox en masse.   

If you do happen to stumble upon a way to archive PM’s in general, uhhh, let us know.    Those should be behind an authentication wall, and you should only have access to your own inbox.  
That goes for any private information.   If you do gain access to anything that should require authentication, please contact us privately.   
I’d say a good rule of thumb for this project is that only data that can be viewed by guests who are not logged in should be archived.  

I don’t think that is what you are doing, @Lisias, your comment just made me think we should state something that hopefully didn’t need said, but probably does. 

Link to comment
Share on other sites

2 minutes ago, Gargamel said:

If you do happen to stumble upon a way to archive PM’s in general, uhhh, let us know.    Those should be behind an authentication wall, and you should only have access to your own inbox.  


That goes for any private information.   If you do gain access to anything that should require authentication, please contact us privately.   
I’d say a good rule of thumb for this project is that only data that can be viewed by guests who are not logged in should be archived. 

That's the idea - providing a tool to people backup their own private messages.

The problem is not technological - you can login and handle the cookies with curl if you want and then scrap whatever you want as long your credentials have access.

The hard part is to provide a tool that:

  • It's easy to be used by the common joe
  • It's safe (i.e., no credentials and message leaking to anyone else)
  • It's trustworthy (it's not enough to be honest, people need to believe - and verify - you are honest)

This idea was originally considered on this post.

 

Link to comment
Share on other sites

1 hour ago, Lisias said:

The hard part is to provide a tool that:

  • It's easy to be used by the common joe
  • It's safe (i.e., no credentials and message leaking to anyone else)
  • It's trustworthy (it's not enough to be honest, people need to believe - and verify - you are honest)

I have had good success with patreondownloader. It uses a chromium browser that is controlled entirely via debugging. That way the user gets a login interface at forum.kerbalspaceprogram.com and the user can check the signature is valid.  Then one can either dump the cookies or use the browser itself to gather all the pages that should be downloaded. As for being trustworthy, being open source _and_ providing compile/build instructions are the way to go I would say. There are some privacy concerns here, depending on what legislature you're under, it may only be legal to make copies of the DMs if both parties have agreed to it, though this may be me misinterpreting the law, feel free to correct me.

Link to comment
Share on other sites

30 minutes ago, HB Stratos said:

There are some privacy concerns here, depending on what legislature you're under, it may only be legal to make copies of the DMs if both parties have agreed to it, though this may be me misinterpreting the law, feel free to correct me.

These privacy concerns are not different from using Google to backup your whatsapp chats. Making backups is ok, giving it to anyone else is not.

But since there's nothing preventing the user from saving the pages directly from their browser (as well letting someone read it over their shoulder), I don't see how this could be a legal problem neither. The expectancy of privacy are exactly the same in all cases, it's up to the user themself to respect the legislation, using such tool or doing things by hand.

 

36 minutes ago, HB Stratos said:

I have had good success with patreondownloader. It uses a chromium browser that is controlled entirely via debugging. That way the user gets a login interface at forum.kerbalspaceprogram.com and the user can check the signature is valid.  Then one can either dump the cookies or use the browser itself to gather all the pages that should be downloaded.

You don't need to use a headless browser to fetch the cookies, you can do everything with wget or curl.

Forum doesn't obfuscate content using JavaScript, you don't need a JS runtime to decode things. Unless you find some browser plugin that would do the job for us - this would change everything, as long we cam trust the author! :)

I think that Firefox could be of use here - there's no close source version of Firefox, and the Mozilla Foundation is pretty careful about privacy - if we find a Firefox plugin that could do this job for us, it would remove a lot of weight from our shoulders.

You know? Nice idea. I will see if I find something on this line.

 

37 minutes ago, HB Stratos said:

As for being trustworthy, being open source _and_ providing compile/build instructions are the way to go I would say.

Being the reason I would prefer to have everything wrote in Python or something that doesn't need to be compiled. :) The easiest the language, more people can learn enough of it to tell if there's something fishy on the code or not.

Don't forget the zx utils supply-chain attack - the injection vector was open and public on github to anyone to see.

Link to comment
Share on other sites

1 hour ago, Lisias said:

Don't forget the zx utils supply-chain attack - the injection vector was open and public on github to anyone to see.

and caught, mind.

Link to comment
Share on other sites

6 minutes ago, R-T-B said:

and caught, mind.

By chance. Would not be that MS engineer get puzzled by that weird CPU peaks, the stunt could had worked.

And, mind this, nobody detected the infringing code until after that dude investigated an unexpected misbehaviour.

Source code is of little use if nobody is reading (and understanding) it - so my decision to pursue trying to use the easiest programming language that are supported enough to be useful, subject of my original post! :)

Link to comment
Share on other sites

NEWS FROM THE FRONT

I currently fetched 513.889 pages, so now we are on uncharted territories (the 2023 dump has 440.387). Don't have a clue about how much time I still need, but logic suggests we are near the end of this first phase.

-r--r--r-- 1 deck deck  41G May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-r--r--r-- 1 deck deck  20G May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-r--r--r-- 1 deck deck 1.5G Jul 13 09:44 forum.kerbalspaceprogram.com-20240713061810129595.warc
-r--r--r-- 1 deck deck 9.4G Jul 15 13:31 forum.kerbalspaceprogram.com-20240713124446675422.warc
-r--r--r-- 1 deck deck 9.4G Jul 19 01:31 forum.kerbalspaceprogram.com-20240715163142475518.warc

I also settled on the compression tool to feed the torrent, lrzip. This thing gave me 44 to 1 compression ratio on the best cases, it's amazing - but, also, extremely slow. Really, really slow - hours and hours to compress these big beasts.

-r--r--r-- 1 deck deck 941M May  7  2023 forum.kerbalspaceprogram.com-00000.warc.lrz
-r--r--r-- 1 deck deck 413M May  7  2023 forum.kerbalspaceprogram.com-00001.warc.lrz
-r--r--r-- 1 deck deck  56M Jul 13 09:44 forum.kerbalspaceprogram.com-20240713061810129595.warc.lrz
-r--r--r-- 1 deck deck 687M Jul 15 13:31 forum.kerbalspaceprogram.com-20240713124446675422.warc.lrz
-r--r--r-- 1 deck deck 536M Jul 19 01:31 forum.kerbalspaceprogram.com-20240715163142475518.warc.lrz

But, damn, it's a 44 to 1 compression ratio!!!

--- -- - POST EDIT - -- ---

I made a mistake! The 2023 WARC files wer fetched using a custom crawler. I'm going the I.A. way, so I was comparing oranges with apples.

The CDX I fetched from IA have 2.164.265 pages. This is the benchmark I should be doing.

So I'm apparently at 25% of the job.

Edited by Lisias
POST EDIT
Link to comment
Share on other sites

35 minutes ago, Lisias said:

NEWS FROM THE FRONT

I currently fetched 513.889 pages, so now we are on uncharted territories (the 2023 dump has 440.387). Don't have a clue about how much time I still need, but logic suggests we are near the end of this first phase.

-r--r--r-- 1 deck deck  41G May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-r--r--r-- 1 deck deck  20G May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-r--r--r-- 1 deck deck 1.5G Jul 13 09:44 forum.kerbalspaceprogram.com-20240713061810129595.warc
-r--r--r-- 1 deck deck 9.4G Jul 15 13:31 forum.kerbalspaceprogram.com-20240713124446675422.warc
-r--r--r-- 1 deck deck 9.4G Jul 19 01:31 forum.kerbalspaceprogram.com-20240715163142475518.warc

I also settled on the compression tool to feed the torrent, lrzip. This thing gave me 44 to 1 compression ratio on the best cases, it's amazing - but, also, extremely slow. Really, really slow - hours and hours to compress these big beasts.

-r--r--r-- 1 deck deck 941M May  7  2023 forum.kerbalspaceprogram.com-00000.warc.lrz
-r--r--r-- 1 deck deck 413M May  7  2023 forum.kerbalspaceprogram.com-00001.warc.lrz
-r--r--r-- 1 deck deck  56M Jul 13 09:44 forum.kerbalspaceprogram.com-20240713061810129595.warc.lrz
-r--r--r-- 1 deck deck 687M Jul 15 13:31 forum.kerbalspaceprogram.com-20240713124446675422.warc.lrz
-r--r--r-- 1 deck deck 536M Jul 19 01:31 forum.kerbalspaceprogram.com-20240715163142475518.warc.lrz

But, damn, it's a 44 to 1 compression ratio!!!

--- -- - POST EDIT - -- ---

I made a mistake! The 2023 WARC files wer fetched using a custom crawler. I'm going the I.A. way, so I was comparing oranges with apples.

The CDX I fetched from IA have 2.164.265 pages. This is the benchmark I should be doing.

So I'm apparently at 25% of the job.

Concerning compression tools: zstd is considered a good compromise between speed and compression ratio, maybe worth a tree if time ends up being an issue.

Link to comment
Share on other sites

17 hours ago, jost said:

Concerning compression tools: zstd is considered a good compromise between speed and compression ratio, maybe worth a tree if time ends up being an issue.

I considered it. But it didn't got near 44 to 1. These files will be torrent'ed and never recompressed again, so it makes sense to use the best compression possible. It will be done only once!

So I made a quick test:

-r--r--r-- 1 deck deck 43543215443 May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-r--r--r-- 1 deck deck  1427252342 May  7  2023 forum.kerbalspaceprogram.com-00000.warc.zst
-r--r--r-- 1 deck deck   986003377 May  7  2023 forum.kerbalspaceprogram.com-00000.warc.lrz

The commands I used was:

zstd -9 --keep forum.kerbalspaceprogram.com-00000.warc
lrzip -z --best --keep forum.kerbalspaceprogram.com-00000.warc

The difference is 441.248.965 bytes between the compression tools, about 425M.  For a file that will be downloaded, ideally, hundred of times.

For someone hosting it on an AWS, that charges you about $0.09 per GB, the difference between zsd and lrz is about $0,03825 per download - for a file that are expected to be available for years.

The costs pile up pretty quickly.

--- -- - POST EDIT - -- ---

I think I detected why it was so slow. I found two bottlenecks:

  1. I didn't reindexed the collection I was feeding, and so when the proxy was fetching again a page, it was being fetched from forum instead of the archive.
    1. reindexing the collections regularly should alleviate this problem
    2. note to myself:
      1. redis is for deduplication, where pywb decides if it will store the page or not - it does not prevent a live fetch
        1. duh...
      2. cdx(j) is for fetching pages from the archive (or not).
      3. We need both, up to date, to do the job
  2. The scrapy tool was fetching the same pages all the time, as a lot of links are targeting already crawled pages, wasting time hitting the proxy (and Forum)
    1. this one I detected after fixing the previous one.

Now that I'm thinking about, these are pretty obvious mistakes - but they just occurred to me today...

I fixing these problems right now, will try them and then I will update the github project.

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project

Edited by Lisias
POST EDIT
Link to comment
Share on other sites

I know I am pushing a bit far beyond the archival part here. Apologies. My concern with that is less about the forum actually shutting down, but about the vacuum this would leave and the potential fragmentation that could result in this great community effectively going extinct. That sounds like doom and gloom, but I have been around long enough to have seen this happen with many beloved communities and so am keen to help prevent it if I can. Not saying it will happen, just that it could.

Does moderation have an insight into the current setup of these forums? Server(s)? Load? That would be very useful in working out a plan on which forum software would be required as only some of them are able to be setup with caching and other load distribution systems.

Even without knowing the specifics about the load the forums are under I would think there are open source solutions that would handle it just fine with the right hardware and some tweaks to the setup. It might not be the most solid solution, but it can certainly work well enough to sustain the community. Often times it comes down to investing time into tweaking such things all the way to updating code to work a little better. We all remember the days of phpBB fondly I would hope. These things have not been sleeping either, but they certainly need a bit of help. I'm sure we have enough sysadmin and developer experience going around that sustaining a community-run forum would not be an issue if properly organized.

I was gonna write that I would setup a whole backup plan for the forum, domain, server and software included, but I think that is a bit premature right now and without the key people in the community and the moderation of the forum on board with establishing such a thing it would only do more harm than good. Of course the easier option would be just for the forums to stay online as they are or for T2 to actively seek a conversation on how to resolve things. One can dream right? :)  My concerns being unfounded and never materializing is the best outcome, so let's hope I'm wrong.

Link to comment
Share on other sites

NEWS FROM THE FRONT

I just updated the scrapping tool.

Now it's scrapping images and styles too, and exactly as I intended: html pages on a collection, images on a second one, anything related to CSS and styling on a third one. The anti dupe code is also working fine, preventing the scrapy tool from visiting the same page twice on a session - exactly what had screwed me July 13th, when I first tried it. I'm doing proper logging now, too.

pywb allows us to dynamically merge the collections and serve them on a single front-end, as they were just one. Pretty convenient.

The rationale for this decision is simple besides not exactly straightforward: images almost never changes, as well styles. Scrapping them separately will save a bit of Forum's resources and scrapping time while updating the collections, as the images will rarely (if ever) change. Same for styles. So we can just ignore them while refreshing the archive contents.

There's an additional benefit on keeping textual info separated from images and styles: whoever owns the IP, owns the images and styles, but not the textual contents. Posts on Forum are almost unrestrictedly and perpetually licensed to the Forum's owner, but they still belong to the original poster. So whoever owns the IP, at least theoretically, have no legal grounds to take down these content - assuming the worst scenario, where this Forum goes titties up and a new owner decides to take down the Forum mirrors, they will be able to do so only for the material they own - images and styling. And these we can easily replace later, forging a new WARC file pretending being that, now lost, content.

Ok, ok, on Real Lifeâ„¢ things doesn't work exactly like that. But it costs very little (if any) to take some preventive measures, no?

Scrapy tells me that it INFO: Crawled 316615 pages (at 58 pages/min), scraped 31847497 items (at 6128 items/min) at this moment, but the last time the WARC file was touched was Jul 20 20:19. So, apparently, the sum of the older WARC files from 2023 and the new ones I'm building now have all the information since the last time I restarted the tool. Please note that the tool understands as scraped item anything it crawls into, disregarding being a dupe or not, or if it was fetched or ignored.

Right now, I'm trying to find my way on the Internet Archive to host the torrent file I will build with the material I already have. I hope it could be a mutable torrent, otherwise I will need to find some other way to host these damned huge files - I intend to have it updated at least once a month, being the reason it needs to be a mutable torrent.

People not willing to host anything can also find this stunt useful, as there're tools to extract the WARC contents and built a dump in their hard disks as you would get by using a 'dumb' crawler as HTTrack.

So, really, there will be no need for everybody and the kitchen's sink to hit Forum all the time to scraping it.

Finally, once I have the torrent hosted somewhere, I will start to find a way to cook a way to scrap the side cooperatively, so many people can share the burden, making thing way faster and saving Forum's resources - once people realize they don't need to scrap things themselves every time, I expect the load on Forum to be way easier.

This is what I currently have:

-r--r--r-- 1 deck deck  41G May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-r--r--r-- 1 deck deck  20G May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-r--r--r-- 1 deck deck  24G Jul 25 15:32 forum.kerbalspaceprogram.com-202407.warc

I expect future WARC files to be smaller and smaller.

At least, but not at last:

On 7/17/2024 at 2:44 AM, Lisias said:

Trying to make things faster for you will make Forum slower for everybody, including you.

Please mind the rest of the users.

 

=== == = POST EDIT = == ===

For the sake of curiosity, some stats I'm fetching from the archives I have at this moment:

forum.kerbalspaceprogram.com-00000.warc
 318421 Content-Type: application/http; msgtype=request
 318421 Content-Type: application/http; msgtype=response
      6 Content-Type:             application/json
      2 Content-Type:     application/json
      8 Content-Type: application/json
      2 Content-Type:                 application/json;charset=utf-8
      2 Content-Type: application/x-www-form-urlencoded
 122909 Content-Type: ;charset=UTF-8
     24 Content-Type: text/html; charset=UTF-8
 195482 Content-Type: text/html;charset=UTF-8

forum.kerbalspaceprogram.com-00001.warc
 121990 Content-Type: application/http; msgtype=request
 121990 Content-Type: application/http; msgtype=response
  40841 Content-Type: ;charset=UTF-8
  81149 Content-Type: text/html;charset=UTF-8
  
  forum.kerbalspaceprogram.com-202407.warc
 553096 Content-Type: application/http; msgtype=request
 553096 Content-Type: application/http; msgtype=response
     27 Content-Type: application/json;charset=UTF-8
      1 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8
      5 Content-Type: application/x-unknown;charset=UTF-8
      8 Content-Type: application/zip;charset=UTF-8
 342190 Content-Type: ;charset=UTF-8
   1497 Content-Type: text/html
      7 Content-Type: text/html; charset=UTF-8
 208622 Content-Type: text/html;charset=UTF-8
      3 Content-Type: text/plain; charset=UTF-8
      2 Content-Type: text/plain;charset=UTF-8
     93 Content-Type: text/xml;charset=UTF-8
     28 Content-Type: video/mp4;charset=UTF-8
      3 Content-Type: video/quicktime;charset=UTF-8

As we can see, some videos leaked into the WARC files... I'm working on it.

=== == = POST2 EDIT = == ===

This one gave me a run for the money.

Forum serves media (like movies) using an interface called /applications/core/interface/file/cfield.php where the file is send on a query. But the crawler was looking only at the path, taking advantage of the fact that Forum doesn't obfuscate the artefacts - what made my life simpler while routing html, image and style files to their respective collections. Until now.

Since the crawler didn't parsed the query, it thought it was a content file and stored the freaking videos togheter the html files, screwing me because:

  1. These videos are Forum's IP, and so shoving them together the textual content would erode my chances on a hypothetical future copyright takedown attempt.
  2. It royally screwed the compression ratio of the WARC file!!

As a matter of fact, I thought it was weird that the 2023 WARC files were compressing at 44 to 1, while mine "only" 22 to 1 - they should compress at similar rations, as they would be similar content. This is the reason I made that stats above, already intuiting some already compressed content had leaked on the stream - but I was thinking on some images or even a zip file, not a whole freaking movie file! :D

Anyway, I salvaged the files, removing the image/* content into the image collection, removing about 2G of binary data from the text stream.

I'm double checking everything and recompressing the data files, I will pursue torrenting them tomorrow.

Cheers!

=== == = POST3 EDIT = == ===

Well, I blew it. I made a really, REALLY, REALLY stupid mistake on the spider, that ended up with a memory leaking that was only growing and growing without I'm being aware.

Then I finally noticed the problem, tried to salvage the situation (scrapy have a telnet interface, in which you can do whatever you want - including hot code change!!) but had no memory enough available.

So I tried terminating one of the proxies (the style one), as it was idle since the start, to have enough memory to work on the system and try adding a swap file on the steam deck. This would life the pressure and allow me to work to try to salvage the session, not to mention stopping grinding my m2 that probably lost half of its lifespan on this stunt... :P

Problem... this dumbass that is typing you this post decided to create the swapfile in the most dumb way possible:

dd if=/dev/zero of=/run/media/deck/MIRROR0/SWAP1 bs=1G count=8

I was lazy (and still sleepy) and decided I didn't wanted to take a calculator to see how many 4K blocks I would need to reach 8G, so I told dd to write 8 blocks of 1G each and call it a day.

But by doing it, dd tried to malloc a 1G buffer on a system that were fighting for 32KB on the current swapfile. So the kernel decided to kill something, elected the SSHD (probably) and the session was finished. However, SteamOS uses that terrible excuse of INIT called SystemD, and this crap automatically kills all the processes owned by a user when it log offs, what essentially is what happens when you lose a SSH session.

And then I had to restart scrapping again this morning. I'm currently back at:

2024-07-27 19:15:36 [scrapy.extensions.logstats] INFO: Crawled 32474 pages (at 80 pages/min), scraped 839693 items (at 2080 items/min)

There was no loss of data, the redis database is being hosted on another computer (I did some things right, after all), so the current scrapping is to be sure I fetched all the pages from Forum, without leaving things behind.

Oh, well... Sheet happens! :)

I'm compressing what I have now (will add a new WARC file with whatever is being scrapped now later) and will proceed with the creation of the torrent.

=== == = POST4 EDIT = == ===

It's about 90 minutes since the log above, and now I have:

2024-07-27 20:46:36 [scrapy.extensions.logstats] INFO: Crawled 39619 pages (at 73 pages/min), scraped 1025463 items (at 1924 items/min)

I.e.:

  •  992.989 items scrapped, or about 11K items scrapped per minute.
  • 7145 pages crawled, or about 79 pages per minute.

I know from fellow scrappers that we have about 400K pages, so unless things go faster, I still have 84.38 hours to complete the task. :/

Oukey, this settles the matter. I will publish the torrent tomorrow, and later update it.

[edit: I made the same mistake again - I gone Internet Archive way, it's way more pages!]

 === == = BRUTE FORCE POST MERGING = == ===

On 7/22/2024 at 12:26 AM, TampaPowers said:

I know I am pushing a bit far beyond the archival part here. Apologies.

By all means, no apologies! This is a brainstorm, we were essentially throwing... "things" :P into the wall to see what sticks.

Jumping the gun is only a problem when there's no one around to tell you are jumping the gun, so no harm, no foul.

And it's good to know that someone is willing to go that extra mile if needed, and this message you passed with success. So, thank you for the offer! ;)

 

On 7/22/2024 at 12:26 AM, TampaPowers said:

My concerns being unfounded and never materializing is the best outcome, so let's hope I'm wrong.

Mine too. I'm really hoping for the best - but still in alert expecting the worst. Trying hard to do not cause it! :)

Cheers! (and sorry answering this so late, I had a hellish week...)

Edited by Lisias
Fixing styling. s/(hey, no tyops this time!)/(bleh. found one!)/g
Link to comment
Share on other sites

NEWS FROM THE FRONT

Yeah, baby, finally a deliverable!

https://archive.org/details/KSP-Forum-Preservation-Project

This torrent has the following contents (at this time), except by the informational and security boilerplate:

-r--r--r-- 1 lisias staff 941M May  7  2023 forum.kerbalspaceprogram.com-00000.warc.lrz
-r--r--r-- 1 lisias staff 413M May  7  2023 forum.kerbalspaceprogram.com-00001.warc.lrz
-r--r--r-- 1 lisias staff 504M Jul 28 09:52 forum.kerbalspaceprogram.com-202407.warc.lrz
-r--r--r-- 1 lisias staff  12M Jul 28 09:49 forum.kerbalspaceprogram.com-images-202407.warc.lrz
-r--r--r-- 1 lisias staff 1.1G Jul 27 09:32 forum.kerbalspaceprogram.com-media-202407.warc.lrz
-r--r--r-- 1 lisias staff 307K Jul 26 12:48 forum.kerbalspaceprogram.com-styles-202407.wrc.lrz
-r--r--r-- 1 lisias staff  24M Jul 28 21:41 redis.dump.json.lrz

For the sake of curiosity, follows the same files uncompressed:

-r--r--r-- 1 deck deck  41G May  7  2023 forum.kerbalspaceprogram.com-00000.warc
-r--r--r-- 1 deck deck  20G May  7  2023 forum.kerbalspaceprogram.com-00001.warc
-r--r--r-- 1 deck deck  23G Jul 28 09:52 forum.kerbalspaceprogram.com-202407.warc
-r--r--r-- 1 deck deck  19M Jul 28 09:49 forum.kerbalspaceprogram.com-images-202407.warc
-r--r--r-- 1 deck deck 1.2G Jul 27 09:32 forum.kerbalspaceprogram.com-media-202407.warc
-r--r--r-- 1 deck deck 1.7M Jul 26 12:48 forum.kerbalspaceprogram.com-styles-202407.warc
-r--r--r-- 1 deck deck 236M Jul 28 21:41 redis.dump.json

Except by images and movies, we get a 40 to 1 compress ration using lrz - you just can't beat this. The Internet Archive infrastructure costs thanks for your understanding! ;)

You will also find some minimal documentation, as well the crypto boilerplate I'm using to guarantee integrity and origin (me):

-r--r--r-- 1 lisias staff 2.9K Jul 28 22:15 README.md
-r--r--r-- 1 lisias staff 1.5K Jul 28 20:14 allowed_signers
-r--r--r-- 1 lisias staff 2.9K Jul 28 20:27 allowed_signers.sig
-r--r--r-- 1 lisias staff 3.0K Jul 28 19:58 forum.kerbalspaceprogram.com-00000.warc.lrz.sig
-r--r--r-- 1 lisias staff 3.0K Jul 28 19:58 forum.kerbalspaceprogram.com-00001.warc.lrz.sig
-r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-202407.warc.lrz.sig
-r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-images-202407.warc.lrz.sig
-r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-media-202407.warc.lrz.sig
-r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-styles-202407.wrc.lrz.sig
-r--r--r-- 1 lisias staff 2.9K Jul 28 21:48 redis.dump.json.lrz.sig
-r-xr-xr-x 1 lisias staff  209 Jul 28 20:28 verify.sh

Have openssh installed and run the verify.sh script and it will validate the files' integrity.

What to do with these WARC files is up to you, but you will find a lot of information about on the Project's repository:

https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project (the project issue tracker is active)

Unfortunately, proper documentation is - as usual - lacking, but I'm working on it:

  • You will find at very least links to the tools I'm using and some hints about how to use them
    • as well the configuration files I'm using for them.
  • You will also find the source code for the crawler there.
  • More to come

Now, some bad news:

I can't guarantee the WARC files above are up to date because, well, they aren't. Scrapping Forum is a race against the clock, and the clock always win - by the time you scrap the last page, there's a lot of new content to be scrapped again. So I'm not even trying at this moment.

The whole process, to tell you the true, is terribly slow once you have the whole site in the database and you are just updating things. Even some pretty aggressive optimizations (as caching in the spider's memory the pages already visited and avoiding them again no matter what) didn't improved the situation to a point I would find it comfortable. I'm currently studying how the deduplication works in the hopes to find some opportunities for performance improvements.

Now, for the next steps:

  1. Documentation
    1. Proper documentation
  2. Setting up a prototype for a content server
    1. Including how to create a Federation of trusted content mirrors to round robin the requests, sharing the burden
  3. Cooking something to allow collaborative scrapping of this site
  4. Setting up a Watch Dog to monitor the (from external point of view) site's health, so we can determine what would be the best times to scrapping it without causing trouble.

 

Spoiler

Not bad for newbie on this scrapping business running a 3 week sprint on the thingy: one probing tools, one setting them up, and one really working with them! :)

Cheers!

Spoiler

In time... Never, ever update your email on Archive.org - you lose rights over the content you had uploaded under the older email... :/

 

Edited by Lisias
Whoops... Archive links are case sensitive!
Link to comment
Share on other sites

The more I browse the forums at the moment, the more I notice something: At some point discord killed the ability to look at the pictures they host without some sort of key in the URL, which expires. Before that was the case, many people used to use discord to host images in forum post. I've found that for most of them, while clicking on them these days shows "This content is no longer available", they are still archived on archive.org. So could we perhaps write a bit of a post processor for the data that replaces dead discord links with ones to web archive?

Link to comment
Share on other sites

29 minutes ago, HB Stratos said:

The more I browse the forums at the moment, the more I notice something: At some point discord killed the ability to look at the pictures they host without some sort of key in the URL, which expires. Before that was the case, many people used to use discord to host images in forum post. I've found that for most of them, while clicking on them these days shows "This content is no longer available", they are still archived on archive.org. So could we perhaps write a bit of a post processor for the data that replaces dead discord links with ones to web archive?

Yes. And nice idea, I'm registering it on issue tracker to avoid forgetting about: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/issues/5

I want to stress that this last deliverable is just the first one, and I'm pretty sure I made some mistakes - I tend to err on the safe side, so I probably lost something instead of archiving something unwanted by accident. I'm specially concerned about copyrights, being that the reason I had split the content in 4 different WARC files, shielding the really important content.

I probably could had used some more time to polish it a bit, but - frankly - good enough is enough for an intermediate deliverable. I need more brains on this project, and I would not get them by trying to do everything right (and failing) by myself. Thanks for jumping in! :)

What makes me remember: a WARC file is a pretty simple straightforward format, it's really easy (besides terribly cumbersome in the extreme cases) to manipulate its contents using grep, sed, perl or even the old and faithful mcedit. So rest assured that we can forge WARC files to inject content into the archives. Obviously, this will be made on a 5th WARC file to avoid tampering the "official" ones. :)

There's really very few things we can't accomplish with this (somewhat convoluted, I admit) scheme of scrapping. The sky, bandwidth and disk space are the limit.

Edited by Lisias
Oh, damnit... Tyops as usulla. :)
Link to comment
Share on other sites

sounds good. For now I (and ChatGPT, sorry) have slapped together a tampermonkey user script that fixes all the dead discord image links by replacing them with the version from archive.org. I have no idea what odd caveats this script has as I've never written JS before, but it seems to be working. Use at your own risk.

// ==UserScript==
// @name         FixDeadDiscordImageLinks
// @namespace    http://tampermonkey.net/
// @version      2024-07-29
// @description  Replace Discord image links with archive.org versions on error as early as possible
// @author       HB Stratos
// @match        https://forum.kerbalspaceprogram.com/*
// @icon         data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
// @grant        none
// @run-at       document-start
// ==/UserScript==
(function() {
    'use strict';

    const TARGET = 'https://cdn.discordapp.com/attachments/*'; // Target URL with wildcards
    // we use timestamp 0 to make archive.org resolve it to the first existing archive, which is usually the intact one. 9999999999 would give the last one, which is usually the 404 error page
    const PREFIX = 'https://web.archive.org/web/0if_/'; // Desired prefix

    function convertWildcardToRegex(pattern) {
        // Escape special characters in the pattern
        let escapedPattern = pattern.replace(/[-\/\\^$+?.()|[\]{}]/g, '\\$&');
        // Replace wildcard '*' with '.*' to match any character sequence
        escapedPattern = escapedPattern.replace(/\*/g, '.*');
        // Create a new RegExp object
        return new RegExp('^' + escapedPattern + '$');
    }

    function handleImageError(event) {
        const img = event.target;
        const regex = new RegExp("https:\/\/cdn\.discordapp\.com\/attachments\/.*");

        if (regex.test(img.src)) {
            // Replace the src with the new URL
            img.src = PREFIX + img.src;
        }
    }

    function monitorImages() {
        const images = document.querySelectorAll('img');

        images.forEach(img => {
            // Add error event listener to each image
            img.addEventListener('error', handleImageError);
        });
    }

    // Monitor images as soon as the script is injected
    monitorImages();

    // Also re-monitor images when new elements are added to the DOM
    const observer = new MutationObserver(mutations => {
        mutations.forEach(mutation => {
            mutation.addedNodes.forEach(node => {
                if (node.tagName === 'IMG') {
                    node.addEventListener('error', handleImageError);
                } else if (node.querySelectorAll) {
                    node.querySelectorAll('img').forEach(img => {
                        img.addEventListener('error', handleImageError);
                    });
                }
            });
        });
    });

    observer.observe(document, { childList: true, subtree: true });
})();

 

Link to comment
Share on other sites

1 hour ago, HB Stratos said:

sounds good. For now I (and ChatGPT, sorry) have slapped together a tampermonkey user script that fixes all the dead discord image links by replacing them with the version from archive.org. I have no idea what odd caveats this script has as I've never written JS before, but it seems to be working. Use at your own risk.

Things can be way simpler with pywb. Just add the Internet Archive as a Collection in your proxy, and it will hit IA for anything missing in your Collections.

As a matter of fact, the hard part is to hit IA only for resources sourced from Forum, this thing is pretty abroad.

Ideally, the pywb proxy should try to hit a source under this scheme:

  1. Check if it exists in the local WARCs. If yes, serve it and finish the request.
  2. Check if it exists in the live web. If yes, serve it and finish the request.
  3. Check if it exists in the IA collection. If yes, serve it and finish the request.
  4. throw a 404.

I updated the issue.

Edited by Lisias
Hit "Save" too soon
Link to comment
Share on other sites

My solution here is not intended for the archive. It's effectively a chrome extension which fixes up the current, existing website, or if the @match is expanded also any other website which has this issue. At runtime it detects every 404 from discord and switches the link to an archive.org one. 

Link to comment
Share on other sites

2 hours ago, HB Stratos said:

My solution here is not intended for the archive. It's effectively a chrome extension which fixes up the current, existing website, or if the @match is expanded also any other website which has this issue. At runtime it detects every 404 from discord and switches the link to an archive.org one. 

Oh, now I see! :)

I gave it a peek anyway. I looked on "my" WARCs for something from discord that would not be working anymore.

Found this:

https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png

This link, currently, leads to a 404 page from discord:

This content is no longer available.

Your code would change it to:

https://web.archive.org/web/0if_/https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png

And I.A woud rewrite it to:

https://web.archive.org/web/20240616102212if_/https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png

But this link also returns a "This content is no longer available" error message the same, because IA had revisited this link after the link had expired. There's no previous visit for this link, so that specific content is lost for good now.

On content that IA had fetched with success at least once, like this one (IA fetched it in the past, by on revisiting it got a 404):

https://cdn.discordapp.com/attachments/252199919316631553/261800273071177728/image.jpg

Your plugin should work fine.

Perhaps the script should fail nicely when IA returns a 404? The rationale is showing the user that the extension is working, and the problem is IA not having the image neither. The way it works, people will come back telling the extension is not working!

 

 

Edited by Lisias
Adding rationale
Link to comment
Share on other sites

On 7/25/2024 at 9:41 PM, Lisias said:

whoever owns the IP, owns the images and styles, but not the textual contents. Posts on Forum are almost unrestrictedly and perpetually licensed to the Forum's owner, but they still belong to the original poster. So whoever owns the IP, at least theoretically, have no legal grounds to take down these content

Firstly, thank you for all of your preservation efforts!

This isn't strictly preservation related, but you're probably a lot more well versed in the legalese than I am, but when you say "images and styles", does that refer to like the background and banners of the forum, or pictures in posts?

And you're saying, for sure, anything posted here belongs to the poster?

I was under the impression that T2 legally owned any content posted here for some reason (probably the panic back when the old EULA change took effect), if this is not in fact the case, and I actually own all the stories and images that were used in the stories that I've made, this is a really big weight off my shoulders.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...