By Taranis in Tech — 14 May 2025

Announcing transarchive.eu

The TransArchive project aims to prevent the erasure of transgender and LGBTQ+ history, scientific data and mutual aid by cataloguing and archiving online data and holding it outside jurisdictions that might seek to eradicate it.

transarchive.eu web page

I've briefly mentioned this in a few places, but I wanted to make a bigger announcement about https://transarchive.eu having gone live. It's still very much in alpha at the moment, but it's moving quickly toward being a real thing so it's time to talk about it.

So what is it, and why did I create it?

The rise of the extreme right wing – fascism by any other name – across the world is extremely concerning, particularly when you happen to be a member of the minority that they are most fond of harming. One of the scariest things for me is seeing information about us and created by us being systematically destroyed – this was a very bad sign in the 1930s, and it's a very bad sign now.

The loss of data and scientific work is fundamental. Without this, it is harder to defend against entirely bogus appeals to 'common sense' or 'science' – if the contrary view is missing, it's easier to warp reality to support draconian policies. It's an old saw that reality has a left-wing bias – this is a simplification, of course, but travesties like the UK Cass Report and the US's more recent equivalent, based mostly on wishful thinking, bad science and made-up bullshit are difficult to unseat once they become entrenched. If the contrary (real, objective) data are simply deleted, we face decades of struggle.

Secondly, and equally importantly, access to information and support will potentially get more difficult. We are already seeing large internet companies start to capitulate, so the information erasure may not be confined to governmental or government-funded sites, and may extend across social media. What happens if our support groups are shut down, or at least information about them is embargoed? What happens when popular and essential information sources online are similarly taken down? Access to gender-affirming care is becoming increasingly difficult in many countries (and is already nearly nonexistent in others, but that's another rant), so this information is life or death for many of us.

My solution isn't perfect, and honestly isn't a solution – only social change reversing the fascism would do that. But what I am attempting to do is archive as much information as possible, primarily trans-related but potentially also anything LGBTQ-related in the wider sense, as well as more mainstream resources that happen to be useful. I'm making this available via the web site at transarchive.eu.

At the time of writing, this is all pretty alpha. This project is only something like 3 weeks old at this point, so having something – anything! – running is a lot better than nothing, but I really wanted to start the archival process as soon as humanly possible. Practically, the intention is that the site will serve partly as a useful index and jumping off point for all things trans-related, whilst also on a best efforts basis archiving the underlying data. Think of it as a trans-specific Internet Archive or Wayback Machine.

How does it work?

I'm going to get a little technical here. No apologies – this might be useful to someone at some point.

The site is split into a front end, which is the thing that appears in the browser, and a back end that runs on servers on the internet side. Most of the site's functionality – the things you click, scroll, interact with, etc., run entirely inside your browser or phone, which is why the site is relatively fast. The technology used here is Svelte 5, a modern Javascript framework that is relatively quick to develop but (importantly) runs very fast with relatively little server overhead. The code is written from scratch. Metadata – the list of sites, their categories, titles, descriptions, settings, the time they were last updated, etc., is all stored in a PostgreSQL database. The site data itself is stored as files in a filesystem.

The server-side functionality is split into a few pieces:

Site server. This is a Svelte 5/Sveltekit back end that serves the site seen in people's browsers. It also provides the API that gives the site access to data and its capability to make changes for moderation purposes.

NAS. There are actually two NASes, a primary and backup. In this architecture, they store the archived site data and also database and system backups.

nginx web server. This reverse-proxies the Sveltekit server and also serves the archived sites via an NFS share on the NAS, making them visible as a single domain name. This avoids the need for archive data to be routed via Sveltekit.

PostgreSQL server. Nothing much to say here, it's pretty straightforward and does what you'd pretty much guess it does.

Web Crawler. This is implemented in Python (my own code), using httrack to perform the actual mirroring. I may replace this later, but it was the quickest way to get something going on a quick-and-dirty-but-actually-working basis.

Automod. The first iteration of the site's design omitted this piece because I naively thought that manual curation/moderation would be sufficient. I thought maybe we'd be mirroring a few hundred sites at most. I was off by a couple of orders of magnitude. Back-of-the-envelope calculations showed that I'd need to recruit literally hundreds of human moderators, each donating an hour or two of their time weekly, even to vaguely stand a chance of keeping up with the crawler. For a near-zero-budget volunteer effort that absolutely needs doing NOW, this wasn't going to work. I had to build automated sentiment/quality analysis to evaluate sites (most links are bad, broken, spammy, or even harmful). It also turned out to be very tedious and slow to have to manually set up categories, titles and descriptions, so automated text summarization was the only way I could make it work. None of this is perfect – only 3 weeks into the project, remember! – but it's working. It isn't a replacement for human moderation or curation, but it changes the job from having to do everything from scratch painfully to just going in and correcting mistakes. I'm intending writing more about how this works and the ethical issues involved, but that will be for a later post.

Front end reverse proxy. This is kind-of like having my own Cloudflare-like capability, which means that the site looks like it's in East Germany, when it is actually elsewhere. I've written a bit more about this in another post.

Security and Privacy

This isn't quite a traditional security and privacy problem. Normally, there is a wealth of user data that is inherently private that must be protected from external interference or eavesdropping. Here, the information we collect and disseminate is already in the public domain, in the sense that it is out there on the internet. Practically, we're not really doing anything much different to a search engine in this respect, so the site data and metadata isn't particularly sensitive. We do of course serve everything end-to-end via TLS with proper certificates. The API that the web site uses to talk to the server is protected with salted HMAC authentication with shared secrets, inside the encrypted TLS connections, but this is only really relevant if a human moderator is signed in. Normal users don't need to sign in at all. The protocol is fully stateless, so cookies are not used or needed and nothing needs to be stored in the browser. Yes, this goes a bit beyond standard Svelte/Sveltekit architecture, but I felt it was important. We don't have a need to track individual users or store information about them, so we simply just don't. The only thing we do keep (for a minimal period) is server-side logs for security and debugging purposes – this is necessary to protect against certain kinds of attack.

From a user's point of view, we don't know much if anything about them, and we don't want to. From a moderator's point of view, we have their login details, but that's pretty much it. User data is therefore about as minimal as it could possibly be for a site that actually has any nontrivial functionality. That said, users' own browser histories, or any information on their communications traffic collected by their internet service provider, may still be dangerous if they are in a situation where being seen to access LGBTQ+ information might put them in danger. For these situations, we recommend viewing the site via Tor. It's not in place at the time of writing, but we intend to publish an onion endpoint so it's unnecessary for anything looking like our domain name to appear anywhere important.

Where it's at now, and where we want it to get to

At the time of writing it is very early in the project's timeline, so things are admittedly rough around the edges. Not everything works perfectly. There is no N+1, let alone N+2. Data quality is a bit iffy in places – we have some duplicates, and sometimes things went really wonky with earlier versions of the crawler, and some of that data needs to be weeded out. It'll happen. The infrastructure isn't completely there yet – I need to add some hard drives to the NASes to cope with a likely large amount of incoming site data. Currently there's no GPU acceleration for the Automod subsystem, which is waiting on me getting back home a couple of weeks from now so I can upgrade the hypervisor on one of the servers so its GPU passthrough works (the VM is already set up, but currently stuck to using CPU, which means that the server is making a noise like a jet engine – the power is mostly solar so there's no particular problem with that, but it's slower than I'd like it to be right now). And the web front end 'worked fine on my machine' but now is showing its little flakies now I'm getting to prod it from other systems. I also need to put together a dev/staging/prod pipeline with proper rollouts and seamless deployment, or SREs will point and laugh. As ever, the last 20% of the effort always takes 142000% of the time. But so it goes!

Search

The biggest missing piece right now is a search engine. Since we will have a substantial collection of relevant information, it would be really nice if all of that was searchable. Of course, I'd like this to not suck, so I'd like to use modern approaches (text embeddings and vector search – the good part that came out of AI research around search – I have no particular appetite for LLM usage for that purpose). Though it's tempting to just use an off-the-shelf text search solution, I've got a fair bit of background in that area so I'm just basically unable to make myself not attempt to do it better!

Open Source

Everything we're using is open source, there are no paid licenses underlying anything, so it remains feasible to release our code at some point. However, it's currently a typical experimental hacky dog's breakfast, so I will want to do a lot of cleaning up and straightening out before anyone else has it inflicted on them.

The plusses and minuses of open sourcing our code are tricky. I'd like to do so, even for no other reason than just getting help with the development side. I can see definite use cases outside the trans/broader queer community, e.g., to help prevent the erasure of BIPOC histories, data and mutual assistance. I'd like to see that happen, and may be willing to actually host an instance (though I'm very aware that I'm probably going up a couple of orders of magnitude in data size and bandwidth needs there).

And finally...

Yes, the chess piece in the top right is a queen.