Posts

We recently wanted to remove an Amazon S3 bucket where 1,000,000+ files were stored. This bucket also had versioning enabled which means the actual number of files was way bigger. To give us an idea, we dump the file paths to delete: the associated output text was 500MB big.

This task which seems simple at first proved to be quite complicated to handle, mostly because of Amazon own limitations that it would be nice to see addressed.

The first thing we had to do is obviously to disable versioning in the Amazon Web Services console:

Without this, not only the bucket would not be emptied but some delete markers would be added to the bucket which would make our life even harder.

The first assumption a user has when wanting to delete a S3 bucket is that clicking on Delete Bucket works. But Amazon does not allow to delete a non empty bucket.

Emptying the bucket through the Amazon Console does not work either when the bucket contains more than 10,000 files. And this is where the troubles begin: simply listing the files to delete ends up crashing the most popular S3 tools like s3cmd.

We found some really interesting scripts which are designed to delete both delete markers and all file versions on a S3 bucket. These scripts were indeed deleting the files on our S3 bucket but kept on running after four days in a row.
The main reason for this is that a query is made for each file deletion. We needed to perform some bulk delete instead.

Amazon CLI provides the capacity to delete up to 1000 files using a single HTTP request via the delete-objects command.

We engineered a ruby script which relies on this command to delete our files faster:

#!/usr/bin/env ruby
require 'csv'
require 'json'
require 'FileUtils'
class S3BucketCleaner
attr_reader :bucket_name, :temp_files_list, :output_file
def self.clean(bucket_name)
new(bucket_name).clean
end
def initialize(bucket_name)
@bucket_name = bucket_name
@temp_files_list = 'files_list.csv'
@output_file = 'aws_commands.sh'
end
def clean
handle_undeleted_files
handle_deleted_files
end
def handle_undeleted_files
handle_files{ `#{undeleted_command}` }
end
def undeleted_command
"echo '#!/bin/bash' > #{temp_files_list} && aws --output text s3api list-object-versions --bucket #{bucket_name} | grep -E \"^VERSIONS\" | awk '{print \"\"$4\",\"$8\"\"}' > #{temp_files_list}"
end
def handle_deleted_files
handle_files{ `#{deleted_command}` }
end
def handle_files
File.open(temp_files_list, 'w') {}
yield
create_batch_delete
`/bin/bash #{output_file}`
remove_files
end
def remove_files
FileUtils.rm(temp_files_list)
FileUtils.rm(output_file)
end
def deleted_command
"echo '#!/bin/bash' > #{temp_files_list} && aws --output text s3api list-object-versions --bucket #{bucket_name} | grep -E \"^DELETEMARKERS\" | awk '{print \"\"$3\",\"$5\"\"}' >> #{temp_files_list}"
end
def create_batch_delete
File.open(output_file, 'w') {|f| f.puts '#!/bin/bash'}
lines = []
nb_lines_processed = 0
IO.foreach(temp_files_list) do |line|
nb_lines_processed += 1
lines << line
if lines.size >= 1000
add_to_batch_delete(lines)
lines = []
end
end
add_to_batch_delete(lines)
nb_lines_processed
end
def add_to_batch_delete(lines)
lines = ::CSV.parse(lines.join, headers: false) rescue nil
if lines && lines.any?
objects = lines.select{|line| line[0] && line[1]}.map{ |line| { "Key" => line[0],'VersionId' => line[1] } }
structure = { 'Objects' => objects, 'Quiet' => true }
command = "aws s3api delete-objects --bucket #{bucket_name} --delete '#{structure.to_json}'"
open(output_file, 'a') do |f|
f.puts command
end
end
end
end
bucket_name = ARGV[0]
S3BucketCleaner.clean(bucket_name)

Pre-requisites:

To use this script you need to:

  • Export your Amazon credentials: export AWS_ACCESS_KEY_ID=... and export AWS_SECRET_ACCESS_KEY=...
  • Have the Amazon CLI installed.
  • Have a Ruby interpreter installed.
  • Download the above file and make it executable: chmod +x FILE

Usage:

Simply execute the script like any other programs with the bucket name you would like to empty as the argument.
E.g: Providing the Ruby script was called S3_bucket_cleaner.rb:

./S3_bucket_cleaner.rb BUCKET_NAME

Figures and conclusion:

The above script was able to remove all the files of our S3 bucket in less than 20 min which was good! It would be great if Amazon let people emptying / removing a S3 bucket regardless how full this one is. In the meantime, we are happy to share this script with you today in case you run into a similar scenario.