AWS experiment 1 - best way to create many small files

Posted: 2020-02-28 | Categories: tech

Introduction

As a way to continue learning AWS and in preparation for future certifications I want to do some experiments on AWS. Each experiment will be easily replicable and will allow you to test it.

This first experiment is foundational and is focused on finding a fast way to create a controlled number files of known (small) size.

Conclusion if you don’t want to read it all:

Create a big file with dd and then spliting it is by far more efficient than using dd in a for-loop for files under 100k.

For example to create 128 files of 10k use the following commands

dd if=/dev/urandom of=random.img count=128 bs=10k
split -n 128 random.img
rm random.img

Conditions

Use of AWS
t2.micro with 8 GB gp2 disk all default
AMI amzn2-ami-hvm-2.0.20200207.1-x86_64-gp2 (ami-0389b2a3c4948b1a0)
Linux ip-172-31-6-151.eu-west-2.compute.internal 4.14.165-131.185.amzn2.x86_64 #1 SMP Wed Jan 15 14:19:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Details

POMO

Purpose:

Being able to generate in a speedy manner many files with random content

Objective

Find a good way to generate lots of small files for future tests between running multiple times dd versus one dd and one split

method

Testing 2 approaches

1.1 - using a for-loop with dd
1.2 - using a single dd to generate a big file and the use split to create many small files

Outcome

A script using the winning method.

Assumptions:

use of dd from urandom to avoid a simple syscall to reserve space on disk
use of sync after each experiment to make sure data is on disk (and not on cache)
use of time (the shell command) to get the measuring data
the difference of time between run time and user+sys is time used to write on disk and due to disk speed and filesystem config

Possible other experiments

test with bigger files
test with more files
on different disks speed
on different disks size
on different technology disks (EFS)
on different instances
changing kernel for disk cache values
changing the filesystem parameters
changing the filesystem

Experimental Script

This bash script is simple and tests both mutiple dd in a for-loop and single dd + split

#!/bin/bash

EXP1NAME=exp1-many-files

[ -z "$2" -o -z "$1" ] && echo "FAILURE need 2 parameters: param1 is size of each file (e.g. 10k), param2 is number of final files" && exit
EXP1SIZE=$1
EXP1COUNT=$2
echo "parameters are all good"
echo "EXP1SIZE = $EXP1SIZE"
echo "EXP1COUNT = $EXP1COUNT"


# back to home dir
cd $HOME
rm -rf ${EXP1NAME}.1 2> /dev/null
mkdir ${EXP1NAME}.1
rm -rf ${EXP1NAME}.2 2> /dev/null
mkdir ${EXP1NAME}.2

echo '*************** START OF EXPERIMENT 1.1 ***************'
date
# experiment 1.1 using dd to generate many small file with a for bash loop
cd $HOME
cd ${EXP1NAME}.1
time for ((i=1; i<=$EXP1COUNT; i++)); do dd if=/dev/urandom of=random-${i}.img count=1 bs=${EXP1SIZE} >/dev/null 2>&1; done
echo "sync"
time sync
date
echo 'END OF EXPERIMENT 1.1'

echo
echo
echo '*************** START OF EXPERIMENT 1.2 ***************'
date
# experiment 1.2 using dd to generate 1 big file then split to split
#we use random data to make sure data is written.
cd $HOME
cd ${EXP1NAME}.2
echo "Generating one file by adding ${EXP1COUNT}  blocks of ${EXP1SIZE} with random content from /dev/urandom"
time dd if=/dev/urandom of=random.img count=${EXP1COUNT} bs=${EXP1SIZE} > /dev/null 2>&1
sync

echo "splitting the file into ${EXP1COUNT} pieces"
time split -n ${EXP1COUNT} random.img
echo "sync"
time sync
date
echo 'END OF EXPERIMENT 1.2'

RUNS

experiments

create 10 files of 10k - testing that is all works
create 10,000 files of 10k bs=10k - baseline
create 100,000 files of 1k bs=1k - disk should be faster to write?
create 1000 files of 100k bs=100k less files more size. FS will play a bigger role

bash commands for the runs

./exp1-many-files.sh 10k 10
./exp1-many-files.sh 10k 10000
./exp1-many-files.sh 1k 100000
./exp1-many-files.sh 100k 1000

execution and raw results

RUN1
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 10k 10
parameters are all good
EXP1SIZE = 10k
EXP1COUNT = 10
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:31:27 UTC 2020

real    0m0.009s
user    0m0.007s
sys     0m0.001s
sync

real    0m0.011s
user    0m0.000s
sys     0m0.000s
Fri Feb 28 11:31:27 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:31:27 UTC 2020
Generating one file by adding 10  blocks of 10k with random content from /dev/urandom

real    0m0.001s
user    0m0.001s
sys     0m0.000s
splitting the file into 10 pieces

real    0m0.001s
user    0m0.001s
sys     0m0.000s
sync

real    0m0.003s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:31:27 UTC 2020
END OF EXPERIMENT 1.2


RUN2
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 10k 10000
parameters are all good
EXP1SIZE = 10k
EXP1COUNT = 10000
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:32:15 UTC 2020

real    0m8.202s
user    0m6.547s
sys     0m1.504s

sync
real    0m0.138s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:32:23 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:32:23 UTC 2020
Generating one file by adding 10000  blocks of 10k with random content from /dev/urandom

real    0m0.565s
user    0m0.000s
sys     0m0.560s
splitting the file into 10000 pieces

real    0m0.336s
user    0m0.024s
sys     0m0.276s
sync

real    0m1.782s
user    0m0.003s
sys     0m0.000s
Fri Feb 28 11:32:27 UTC 2020
END OF EXPERIMENT 1.2

RUN 3
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 1k 100000
parameters are all good
EXP1SIZE = 1k
EXP1COUNT = 100000
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:32:53 UTC 2020

real    1m18.358s
user    1m2.505s
sys     0m13.952s
sync

real    0m0.088s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:34:11 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:34:11 UTC 2020
Generating one file by adding 100000  blocks of 1k with random content from /dev/urandom

real    0m0.776s
user    0m0.076s
sys     0m0.670s
splitting the file into 100000 pieces

real    0m8.028s
user    0m0.232s
sys     0m3.323s
sync

real    0m0.940s
user    0m0.003s
sys     0m0.000s
Fri Feb 28 11:34:22 UTC 2020
END OF EXPERIMENT 1.2

RUN 4
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 100k 1000
parameters are all good
EXP1SIZE = 100k
EXP1COUNT = 1000
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:35:13 UTC 2020

real    0m1.342s
user    0m0.948s
sys     0m0.362s
sync

real    0m0.170s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:35:15 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:35:15 UTC 2020
Generating one file by adding 1000  blocks of 100k with random content from /dev/urandom

real    0m0.555s
user    0m0.004s
sys     0m0.546s
splitting the file into 1000 pieces

real    0m0.084s
user    0m0.004s
sys     0m0.076s
sync

real    0m1.520s
user    0m0.002s
sys     0m0.000s
Fri Feb 28 11:35:18 UTC 2020
END OF EXPERIMENT 1.2

in table form

Runs	for-loop	dd + split
1	0.02s	0.005s
2	8.34s	2.683s
3	78.446s	9.744s
4	1.512s	2.159s

Conclusion

For files of less than 100k it is much better to create a single big file and use split to split it. This is probably due to the amount of time a bash for-loop wastes on context switch in the kernel.

Final script

Result data in excel

Excel with results