AWS experiment 1 - best way to create many small files

Introduction

As a way to continue learning AWS and in preparation for future certifications I want to do some experiments on AWS. Each experiment will be easily replicable and will allow you to test it.

This first experiment is foundational and is focused on finding a fast way to create a controlled number files of known (small) size.

Conclusion if you don’t want to read it all:

Create a big file with dd and then spliting it is by far more efficient than using dd in a for-loop for files under 100k.

For example to create 128 files of 10k use the following commands

dd if=/dev/urandom of=random.img count=128 bs=10k
split -n 128 random.img
rm random.img

Conditions

  • Use of AWS
  • t2.micro with 8 GB gp2 disk all default
  • AMI amzn2-ami-hvm-2.0.20200207.1-x86_64-gp2 (ami-0389b2a3c4948b1a0)
  • Linux ip-172-31-6-151.eu-west-2.compute.internal 4.14.165-131.185.amzn2.x86_64 #1 SMP Wed Jan 15 14:19:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Details

POMO

Purpose:

Being able to generate in a speedy manner many files with random content

Objective

Find a good way to generate lots of small files for future tests between running multiple times dd versus one dd and one split

method

Testing 2 approaches

  • 1.1 - using a for-loop with dd
  • 1.2 - using a single dd to generate a big file and the use split to create many small files

Outcome

A script using the winning method.

Assumptions:

  • use of dd from urandom to avoid a simple syscall to reserve space on disk
  • use of sync after each experiment to make sure data is on disk (and not on cache)
  • use of time (the shell command) to get the measuring data
  • the difference of time between run time and user+sys is time used to write on disk and due to disk speed and filesystem config

Possible other experiments

  • test with bigger files
  • test with more files
  • on different disks speed
  • on different disks size
  • on different technology disks (EFS)
  • on different instances
  • changing kernel for disk cache values
  • changing the filesystem parameters
  • changing the filesystem

Experimental Script

This bash script is simple and tests both mutiple dd in a for-loop and single dd + split

#!/bin/bash

EXP1NAME=exp1-many-files

[ -z "$2" -o -z "$1" ] && echo "FAILURE need 2 parameters: param1 is size of each file (e.g. 10k), param2 is number of final files" && exit
EXP1SIZE=$1
EXP1COUNT=$2
echo "parameters are all good"
echo "EXP1SIZE = $EXP1SIZE"
echo "EXP1COUNT = $EXP1COUNT"


# back to home dir
cd $HOME
rm -rf ${EXP1NAME}.1 2> /dev/null
mkdir ${EXP1NAME}.1
rm -rf ${EXP1NAME}.2 2> /dev/null
mkdir ${EXP1NAME}.2

echo '*************** START OF EXPERIMENT 1.1 ***************'
date
# experiment 1.1 using dd to generate many small file with a for bash loop
cd $HOME
cd ${EXP1NAME}.1
time for ((i=1; i<=$EXP1COUNT; i++)); do dd if=/dev/urandom of=random-${i}.img count=1 bs=${EXP1SIZE} >/dev/null 2>&1; done
echo "sync"
time sync
date
echo 'END OF EXPERIMENT 1.1'

echo
echo
echo '*************** START OF EXPERIMENT 1.2 ***************'
date
# experiment 1.2 using dd to generate 1 big file then split to split
#we use random data to make sure data is written.
cd $HOME
cd ${EXP1NAME}.2
echo "Generating one file by adding ${EXP1COUNT}  blocks of ${EXP1SIZE} with random content from /dev/urandom"
time dd if=/dev/urandom of=random.img count=${EXP1COUNT} bs=${EXP1SIZE} > /dev/null 2>&1
sync

echo "splitting the file into ${EXP1COUNT} pieces"
time split -n ${EXP1COUNT} random.img
echo "sync"
time sync
date
echo 'END OF EXPERIMENT 1.2'

RUNS

experiments

  1. create 10 files of 10k - testing that is all works
  2. create 10,000 files of 10k bs=10k - baseline
  3. create 100,000 files of 1k bs=1k - disk should be faster to write?
  4. create 1000 files of 100k bs=100k less files more size. FS will play a bigger role

bash commands for the runs

  1. ./exp1-many-files.sh 10k 10
  2. ./exp1-many-files.sh 10k 10000
  3. ./exp1-many-files.sh 1k 100000
  4. ./exp1-many-files.sh 100k 1000

execution and raw results

RUN1
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 10k 10
parameters are all good
EXP1SIZE = 10k
EXP1COUNT = 10
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:31:27 UTC 2020

real    0m0.009s
user    0m0.007s
sys     0m0.001s
sync

real    0m0.011s
user    0m0.000s
sys     0m0.000s
Fri Feb 28 11:31:27 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:31:27 UTC 2020
Generating one file by adding 10  blocks of 10k with random content from /dev/urandom

real    0m0.001s
user    0m0.001s
sys     0m0.000s
splitting the file into 10 pieces

real    0m0.001s
user    0m0.001s
sys     0m0.000s
sync

real    0m0.003s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:31:27 UTC 2020
END OF EXPERIMENT 1.2


RUN2
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 10k 10000
parameters are all good
EXP1SIZE = 10k
EXP1COUNT = 10000
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:32:15 UTC 2020

real    0m8.202s
user    0m6.547s
sys     0m1.504s

sync
real    0m0.138s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:32:23 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:32:23 UTC 2020
Generating one file by adding 10000  blocks of 10k with random content from /dev/urandom

real    0m0.565s
user    0m0.000s
sys     0m0.560s
splitting the file into 10000 pieces

real    0m0.336s
user    0m0.024s
sys     0m0.276s
sync

real    0m1.782s
user    0m0.003s
sys     0m0.000s
Fri Feb 28 11:32:27 UTC 2020
END OF EXPERIMENT 1.2

RUN 3
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 1k 100000
parameters are all good
EXP1SIZE = 1k
EXP1COUNT = 100000
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:32:53 UTC 2020

real    1m18.358s
user    1m2.505s
sys     0m13.952s
sync

real    0m0.088s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:34:11 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:34:11 UTC 2020
Generating one file by adding 100000  blocks of 1k with random content from /dev/urandom

real    0m0.776s
user    0m0.076s
sys     0m0.670s
splitting the file into 100000 pieces

real    0m8.028s
user    0m0.232s
sys     0m3.323s
sync

real    0m0.940s
user    0m0.003s
sys     0m0.000s
Fri Feb 28 11:34:22 UTC 2020
END OF EXPERIMENT 1.2

RUN 4
[ec2-user@ip-172-31-6-151 ~]$ ./exp1-many-files.sh 100k 1000
parameters are all good
EXP1SIZE = 100k
EXP1COUNT = 1000
*************** START OF EXPERIMENT 1.1 ***************
Fri Feb 28 11:35:13 UTC 2020

real    0m1.342s
user    0m0.948s
sys     0m0.362s
sync

real    0m0.170s
user    0m0.001s
sys     0m0.000s
Fri Feb 28 11:35:15 UTC 2020
END OF EXPERIMENT 1.1


*************** START OF EXPERIMENT 1.2 ***************
Fri Feb 28 11:35:15 UTC 2020
Generating one file by adding 1000  blocks of 100k with random content from /dev/urandom

real    0m0.555s
user    0m0.004s
sys     0m0.546s
splitting the file into 1000 pieces

real    0m0.084s
user    0m0.004s
sys     0m0.076s
sync

real    0m1.520s
user    0m0.002s
sys     0m0.000s
Fri Feb 28 11:35:18 UTC 2020
END OF EXPERIMENT 1.2

in table form

Runs for-loop dd + split
1 0.02s 0.005s
2 8.34s 2.683s
3 78.446s 9.744s
4 1.512s 2.159s

Conclusion

For files of less than 100k it is much better to create a single big file and use split to split it. This is probably due to the amount of time a bash for-loop wastes on context switch in the kernel.

Final script

Result data in excel

Excel with results

comments powered by Disqus