16.6 FSx for Lustre: High-Performance Parallel File System for HPC and ML
Right, so you need to go fast. Not “my-internet-is-out-and-I’m-trying-to-watch-a-video” fast. We’re talking about the kind of speed that makes physicists nervous. You’re probably here because you’re dabbling in high-performance computing (HPC), machine learning (ML) on a massive dataset, or maybe you’re just a performance junkie. Welcome. FSx for Lustre is your new best friend, a fully managed parallel file system that Amazon basically yanked out of a supercomputing center and shoved into a data center rack for you. It’s obscenely fast, and it’s built for the specific use case where many computers need to read and write to the same storage at the same time without tripping over each other.
The magic word here is parallel. A traditional file system (even a fancy one like FSx for Windows File Server) is like a single-lane road to a library. One car (server) goes in, gets a book (file), and comes out. FSx for Lustre is that same library, but with a thousand doors and a conveyor belt system so a thousand people can grab a thousand different books simultaneously. This is called a parallel file system, and it’s the only way to feed data-hungry applications like large-scale simulations or training a model on petabytes of image data without creating a brutal I/O bottleneck.
Deployment: Scratch, Persistent, and What the Heck is an MDT?
When you fire up an FSx for Lustre file system, you’ve got two main flavors to choose from, and this choice is critical. Get it wrong, and you’ll either waste a ton of money or lose your data. No pressure.
First, there’s SCRATCH. This is the performance-optimized, pedal-to-the-metal option. It’s designed for short-lived, ultra-high-speed workloads where data is ephemeral. Think of it as the RAM of the file system world. It’s faster and cheaper per GB, but it doesn’t have multi-AZ redundancy. If the underlying hardware fails, your data is gone. The name is your clue: it’s for scratch work. You use it, you get your results, and you ship those results off to somewhere durable like S3. It has two sub-types: scratch_1 and the newer scratch_2, which offers even higher burst capabilities. If your HPC cluster is running a two-week simulation, Scratch is your go-to.
Then there’s PERSISTENT. This is the durable, long-haul option. It’s slightly less mind-bendingly fast on pure throughput (though still incredibly fast), but it’s replicated across AZs. If a hardware failure occurs, AWS fails over to the standby, and you won’t lose a byte. You pay more for this privilege. This is for workloads where the file system itself is a long-term asset, like an active ML feature store that’s constantly being updated.
Under the hood, the performance difference often comes down to the Metadata Target (MDT) and Object Storage Targets (OSTs). The MDT is the librarian—it knows where every file is. The OSTs are the shelves where the actual data is stored. Scratch file systems often have a single MDT, which is a potential bottleneck for metadata-heavy operations (like creating millions of small files). Persistent file systems can be deployed with multiple MDTs, allowing metadata operations to be parallelized as well. Choose wisely.
Here’s how you deploy a basic Persistent file system with the AWS CLI. Notice I’m linking it directly to an S3 bucket because that’s its best friend.
# Create a security group that allows traffic on the Lustre ports (988, 1021-1023)
aws ec2 create-security-group --group-name LustreSG --description "For FSx Lustre"
# Create the file system, linked to an existing S3 bucket for easy data ingestion
aws fsx create-file-system \
--file-system-type LUSTRE \
--storage-capacity 1200 \
--subnet-ids subnet-12345678 \
--security-group-ids sg-12345678 \
--lustre-configuration \
DeploymentType=PERSISTENT_1 \
PerUnitStorageThroughput=250 \
DataCompressionType=LZ4 \
ImportPath=s3://my-source-data-bucket/input-dir \
ExportPath=s3://my-processed-data-bucket/output-dir
The S3 Symbiosis: Your Data Conveyor Belt
This is arguably Lustre’s killer feature. Notice the ImportPath and ExportPath in the command above? You can point your Lustre file system directly at an S3 bucket. This does two brilliant things:
- Data Ingestion: When you first create the file system, it will automatically pull the contents of your S3
ImportPathinto the Lustre file system. It does this in parallel, so you can hydrate a multi-TB file system in minutes, not hours. - Data Repository Associations (DRA): This is the ongoing magic. You can configure a policy (e.g.,
AUTO_IMPORTorAUTO_EXPORT) that automatically syncs changes between Lustre and S3. Create a file in Lustre? It can automatically appear in S3. Delete something in S3? It can be automatically removed from Lustre. This is your built-in data pipeline, and it eliminates the need for clunkyaws s3 synccommands.
The key thing to remember is the latency. S3 is object storage with high latency; Lustre is a parallel file system with microsecond latency. You use S3 as your cheap, durable archive. You use Lustre as your high-performance workbench. The DRA is the robotic arm that moves data between the archive and the workbench.
Performance: It’s All About Throughput and IOPS
You’re not using this for your text documents. You care about performance. FSx for Lustre performance is primarily determined by two things you choose at creation:
- Storage Capacity: This one is straightforward. The more GB you provision, the higher the baseline throughput. It’s a linear scale.
- Throughput per Unit (MB/s/TiB): This is the big lever. You choose a value like 50, 100, 200, or 250 (for Persistent). This number multiplied by your storage capacity gives you your file system’s baseline throughput. A 1.2 TiB file system with 200 MB/s/TiB gives you a baseline of 2400 MB/s. That’s not a typo. That’s 2.4 GB per second.
But wait, there’s more! Both Scratch and Persistent types can burst far beyond their baseline using credits. It’s like a CPU burst on an EC2 instance. If you have a short, intense workload, you can achieve speeds upwards of tens of GB/s. Scratch file systems get more burst credits and higher burst throughput, which is why they feel faster for short jobs.
Here’s how you check on your burst credits to see if you’re about to performance-wall yourself:
aws fsx describe-file-systems --file-system-id fs-12345678
Look in the response for LustreConfiguration and then DataRepositoryConfiguration to find your AvailableCredit and MaxBurstCredit. Keep an eye on this. If you deplete your credits, you’ll fall back to your baseline speed, which might cause your 10,000-core cluster to suddenly start tapping its foot impatiently.
The Gotchas: Because Nothing is Perfect
I wouldn’t be your brilliant friend if I didn’t warn you about the quirks.
- Small Files are the Enemy: Lustre is built for large, sequential operations. If your workload involves writing millions of tiny files, you will hammer the MDT (the librarian) and performance will tank. If you must do this, you need a Multi-AZ Persistent file system with multiple MDTs.
- Backups are on You: Unlike other FSx file systems, AWS does not provide a native, point-in-time snapshot backup mechanism for Lustre. Your backup strategy is to use the DRA to export your important results back to S3, and then use S3 versioning and lifecycle policies. Don’t learn this one the hard way.
- Cost: This is not cheap. You’re paying for provisioned capacity and throughput 24/7. If you spin up a 10 TB Persistent file system with 250 MB/s/TiB, you’re paying for 2.5 GB/s of throughput even if it’s sitting idle at 3 a.m. on a Sunday. Shut it down when you’re not using it! Use a Lambda function to automatically create and delete it as part of your compute job lifecycle. Treating it like a permanent file system is the fastest way to get a heart-stopping bill.
So, use FSx for Lustre when your problem is fundamentally about parallel I/O speed. Integrate it tightly with S3. Choose Scratch for the disposable, ultra-fast jobs and Persistent for the ones you can’t afford to re-run. And for heaven’s sake, mind your burst credits. Now go make that cluster sweat.