Other articles for databases geeks
- Using Athena from the AWS CLI on Linux
- Getting today in Athena
- Getting today in PostgreSQL
- How to get the metadata of AWS Athena tables
- Get the number of columns in an AWS Athena table
One of the amazing things of AWS is that the CLI allows using most of the AWS services from the shell, in my case, bash. So here is my cheat-sheet to remind me how to use Athena from Bash.
Configure your AWS CLI
Have the AWS CLI configured as documented here either by using the standard access and secret keys, SSO or Roles?.
I always define the following 3 variables to run the commands in the right account and right region.
$ export AWS_ACCOUNT='1234567890'
$ export AWS_REGION='eu-west-1'
$ export ATHENA_S3_PATH='s3://<myS3Bucket>/athena/' # for query results
Test the connection
To check that the configuration works.
$ aws athena list-data-catalogs --region $AWS_REGION
{
"DataCatalogsSummary": [
{
"CatalogName": "AwsDataCatalog",
"Type": "GLUE"
}
]
}
Run a query that does not return any value (DDL)
From the command line
Put the query in the variable sqlquery
I use read to allow for multi-line variable while writing.
$ read -d '' sqlquery << EOF
> CREATE EXTERNAL TABLE test.frank (
> col1 string,
> col2 int,
> col3 string)
> LOCATION 's3://<s3PathToaCSVFileWith3Columns>'
> EOF
$ echo $sqlquery
CREATE EXTERNAL TABLE test.frank ( col1 string, col2 int, col3 string) LOCATION 's3://<s3PathToaCSVFileWith3Columns>'
Run the query
$ aws athena start-query-execution --region $AWS_REGION --query-string "$sqlquery" --result-configuration "OutputLocation=$ATHENA_S3_PATH"
{
"QueryExecutionId": "728f29d4-9e25-4730-a4f8-cf9a9195a273"
}
$ aws athena get-query-execution --query-execution-id "728f29d4-9e25-4730-a4f8-cf9a9195a273" --region=$AWS_REGION
{
"QueryExecution": {
"QueryExecutionId": "728f29d4-9e25-4730-a4f8-cf9a9195a273",
"Query": "CREATE EXTERNAL TABLE test.frank (\ncol1 string,\ncol2 int,\ncol3 string) \nLOCATION 's3://<s3PathToaCSVFileWith3Columns>'",
"StatementType": "DDL",
"ResultConfiguration": {
"OutputLocation": "s3://<myS3Bucket>/athena/728f29d4-9e25-4730-a4f8-cf9a9195a273.txt"
},
"QueryExecutionContext": {},
"Status": {
"State": "SUCCEEDED",
"SubmissionDateTime": "2021-02-11T11:16:14.152000+00:00",
"CompletionDateTime": "2021-02-11T11:16:14.588000+00:00"
},
"Statistics": {
"EngineExecutionTimeInMillis": 354,
"DataScannedInBytes": 0,
"TotalExecutionTimeInMillis": 436,
"QueryQueueTimeInMillis": 57,
"ServiceProcessingTimeInMillis": 25
},
"WorkGroup": "primary"
}
}
The state is SUCCEEDED but could be QUEUED, RUNNING, FAILED.
From a script
info, debug, error and warning are functions I use in my scripts for logging, consider them an echo
$ athenatablequery="CREATE EXTERNAL TABLE test.frank ( col1 string, col2 int, col3 string) LOCATION ' "
info 'Trying to create the athena table test.frank'
queryexecutionid=$(aws athena start-query-execution --region $AWS_REGION --query-string "$athenatablequery" --result-configuration "OutputLocation=$ATHENA_S3_PATH" | jq -r .QueryExecutionId)
debug 'got the queryexecution id [' $queryexecutionid']'
querystatusstate='QUEUED'
getqueryexecutionstring="aws athena get-query-execution --query-execution-id $queryexecutionid --region=$AWS_REGION"
while [ "$querystatusstate" = "RUNNING" -o "$querystatusstate" = "QUEUED" ]
do
debug 'command line is [' $getqueryexecutionstring ']'
qsuerystatusjson=$($getqueryexecutionstring)
debug 'qsuerystatusjson is [' $qsuerystatusjson ']'
querystatusstate=$(echo $qsuerystatusjson | jq -r .QueryExecution.Status.State)
querystatuschangereason=$(echo $qsuerystatusjson | jq -r .QueryExecution.Status.StateChangeReason)
debug 'querystatusstate value is [' $querystatusstate ']'
debug 'querystatuschangereason value is [' $querystatuschangereason ']'
done
if [ $querystatusstate = 'SUCCEEDED' ] ;then
info 'Table created!'
else
if [[ $querystatuschangereason =~ .*AlreadyExistsException.* ]]; then # using a BASH only feature with regexp
info '-> The table already exists' && exit 4
else
error 'unexpected result [' $querystatusstate '] fail' && exit 3
fi
fi
Run a query that returns something (DDL)
From the command line
Put the query in the variable sqlquery
I use read to allow for multi-line variable while writing.
$ read -d '' sqlquery << EOF
> SELECT col1, col2, col3
> FROM test.frank
> EOF
$ echo $sqlquery
SELECT col1, col2, col3 FROM test.frank
Run the query
$ aws athena start-query-execution --region $AWS_REGION --query-string "$sqlquery" --result-configuration "OutputLocation=$ATHENA_S3_PATH"
{
"QueryExecutionId": "3d2939f8-3378-4f64-ae0f-01be2dbc842c"
}
$ aws athena get-query-execution --query-execution-id "3d2939f8-3378-4f64-ae0f-01be2dbc842c" --region=$AWS_REGION
{
"QueryExecution": {
"QueryExecutionId": "3d2939f8-3378-4f64-ae0f-01be2dbc842c",
"Query": "SELECT col1, col2, col3\nFROM test.frank",
"StatementType": "DML",
"ResultConfiguration": {
"OutputLocation": "s3://<myS3Bucket>/athena/3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv"
},
"QueryExecutionContext": {},
"Status": {
"State": "SUCCEEDED",
"SubmissionDateTime": "2021-02-11T11:31:00.966000+00:00",
"CompletionDateTime": "2021-02-11T11:31:03.875000+00:00"
},
"Statistics": {
"EngineExecutionTimeInMillis": 2559,
"DataScannedInBytes": 1468006,
"TotalExecutionTimeInMillis": 2909,
"QueryQueueTimeInMillis": 235,
"QueryPlanningTimeInMillis": 488,
"ServiceProcessingTimeInMillis": 115
},
"WorkGroup": "primary"
}
}
The state is SUCCEEDED, and the results are in s3:///athena/3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv
Getting the data the JSON way
$ aws athena get-query-results --query-execution-id "3d2939f8-3378-4f64-ae0f-01be2dbc842c" --region=$AWS_REGION
Then you can play with jq to get the data you need, but it is hard. Or you can check the line with OutputLocation and copy the results in CSV format using s3
$ aws s3 cp s3://<myS3Bucket>/athena/3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv .
The file 3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv now in your folder contains the data you seek.
From a script
querystring="SELECT col1, col2, col3 FROM test.frank;"
queryexecutionid=$(aws athena start-query-execution --region $AWS_REGION --query-string "$querystring" --result-configuration "OutputLocation=$ATHENA_S3_PATH" | jq -r .QueryExecutionId)
debug 'got the queryexecution id [' $queryexecutionid']'
querystatusstate='QUEUED'
getqueryexecutionstring="aws athena get-query-execution --query-execution-id $queryexecutionid --region=$AWS_REGION"
while [ "$querystatusstate" = "RUNNING" -o "$querystatusstate" = "QUEUED" ]
do
#debug 'command line is [' $getqueryexecutionstring ']'
qsuerystatusjson=$($getqueryexecutionstring)
#debug 'qsuerystatusjson is [' $qsuerystatusjson ']'
querystatusstate=$(echo $qsuerystatusjson | jq -r .QueryExecution.Status.State)
debug 'querystatusstate value is [' $querystatusstate ']'
done
[ $querystatusstate != 'SUCCEEDED' ] && debug 'unexpected result [' $querystatusstate '] fail' && exit 3
resultlocation=$(aws athena get-query-execution --query-execution-id "$queryexecutionid" --region=$AWS_REGION | jq -r '.QueryExe
cution.ResultConfiguration.OutputLocation')
[ $? -ne 0 ] && error "Dying" && exit 111
s3 cp $resultlocation .
localfilename="${queryexecutionid}.csv