Using AWS Athena from AWS CLI and bash.

Other articles for databases geeks

One of the amazing things of AWS is that the CLI allows using most of the AWS services from the shell, in my case, bash. So here is my cheat-sheet to remind me how to use Athena from Bash.

Configure your AWS CLI

Have the AWS CLI configured as documented here either by using the standard access and secret keys, SSO or Roles?.

I always define the following 3 variables to run the commands in the right account and right region.

$ export AWS_ACCOUNT='1234567890'
$ export AWS_REGION='eu-west-1' 
$ export ATHENA_S3_PATH='s3://<myS3Bucket>/athena/' # for query results

Test the connection

To check that the configuration works.

$ aws athena list-data-catalogs --region $AWS_REGION
{
    "DataCatalogsSummary": [
        {
            "CatalogName": "AwsDataCatalog",
            "Type": "GLUE"
        }
    ]
}

Run a query that does not return any value (DDL)

From the command line

Put the query in the variable sqlquery

I use read to allow for multi-line variable while writing.

$ read -d '' sqlquery << EOF
> CREATE EXTERNAL TABLE test.frank (
> col1 string,
> col2 int,
> col3 string)
> LOCATION 's3://<s3PathToaCSVFileWith3Columns>'
> EOF

$ echo $sqlquery
CREATE EXTERNAL TABLE test.frank ( col1 string, col2 int, col3 string) LOCATION 's3://<s3PathToaCSVFileWith3Columns>'

Run the query

$ aws athena start-query-execution --region $AWS_REGION --query-string "$sqlquery" --result-configuration "OutputLocation=$ATHENA_S3_PATH"
{
    "QueryExecutionId": "728f29d4-9e25-4730-a4f8-cf9a9195a273"
}

$ aws athena get-query-execution --query-execution-id "728f29d4-9e25-4730-a4f8-cf9a9195a273" --region=$AWS_REGION
{
    "QueryExecution": {
        "QueryExecutionId": "728f29d4-9e25-4730-a4f8-cf9a9195a273",
        "Query": "CREATE EXTERNAL TABLE test.frank (\ncol1 string,\ncol2 int,\ncol3 string) \nLOCATION 's3://<s3PathToaCSVFileWith3Columns>'",
        "StatementType": "DDL",
        "ResultConfiguration": {
            "OutputLocation": "s3://<myS3Bucket>/athena/728f29d4-9e25-4730-a4f8-cf9a9195a273.txt"
        },
        "QueryExecutionContext": {},
        "Status": {
            "State": "SUCCEEDED",
            "SubmissionDateTime": "2021-02-11T11:16:14.152000+00:00",
            "CompletionDateTime": "2021-02-11T11:16:14.588000+00:00"
        },
        "Statistics": {
            "EngineExecutionTimeInMillis": 354,
            "DataScannedInBytes": 0,
            "TotalExecutionTimeInMillis": 436,
            "QueryQueueTimeInMillis": 57,
            "ServiceProcessingTimeInMillis": 25
        },
        "WorkGroup": "primary"
    }
}

The state is SUCCEEDED but could be QUEUED, RUNNING, FAILED.

From a script

info, debug, error and warning are functions I use in my scripts for logging, consider them an echo

$ athenatablequery="CREATE EXTERNAL TABLE test.frank ( col1 string, col2 int, col3 string) LOCATION ' "
info 'Trying to create the athena table test.frank'
queryexecutionid=$(aws athena start-query-execution --region $AWS_REGION --query-string "$athenatablequery" --result-configuration "OutputLocation=$ATHENA_S3_PATH" | jq -r .QueryExecutionId)
debug 'got the queryexecution id [' $queryexecutionid']'
querystatusstate='QUEUED'
getqueryexecutionstring="aws athena get-query-execution --query-execution-id $queryexecutionid --region=$AWS_REGION"
while [ "$querystatusstate" = "RUNNING" -o "$querystatusstate" = "QUEUED" ]
do
    debug 'command line is [' $getqueryexecutionstring ']'
    qsuerystatusjson=$($getqueryexecutionstring)
    debug 'qsuerystatusjson is [' $qsuerystatusjson ']'
    querystatusstate=$(echo $qsuerystatusjson | jq -r .QueryExecution.Status.State)
    querystatuschangereason=$(echo $qsuerystatusjson | jq -r .QueryExecution.Status.StateChangeReason)
    
    debug 'querystatusstate value is [' $querystatusstate ']'
    debug 'querystatuschangereason value is [' $querystatuschangereason ']'
done
if [ $querystatusstate = 'SUCCEEDED' ] ;then
    info 'Table created!'
else 
    if [[ $querystatuschangereason =~ .*AlreadyExistsException.* ]]; then # using a BASH only feature with regexp
        info '-> The table already exists' && exit 4
    else
        error 'unexpected result [' $querystatusstate '] fail' && exit 3
    fi
fi

Run a query that returns something (DDL)

From the command line

Put the query in the variable sqlquery

I use read to allow for multi-line variable while writing.

$ read -d '' sqlquery << EOF
> SELECT col1, col2, col3
> FROM test.frank
> EOF

$ echo $sqlquery
SELECT col1, col2, col3 FROM test.frank

Run the query

$ aws athena start-query-execution --region $AWS_REGION --query-string "$sqlquery" --result-configuration "OutputLocation=$ATHENA_S3_PATH"
{
    "QueryExecutionId": "3d2939f8-3378-4f64-ae0f-01be2dbc842c"
}

$ aws athena get-query-execution --query-execution-id "3d2939f8-3378-4f64-ae0f-01be2dbc842c" --region=$AWS_REGION
{
    "QueryExecution": {
        "QueryExecutionId": "3d2939f8-3378-4f64-ae0f-01be2dbc842c",
        "Query": "SELECT col1, col2, col3\nFROM test.frank",
        "StatementType": "DML",
        "ResultConfiguration": {
            "OutputLocation": "s3://<myS3Bucket>/athena/3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv"
        },
        "QueryExecutionContext": {},
        "Status": {
            "State": "SUCCEEDED",
            "SubmissionDateTime": "2021-02-11T11:31:00.966000+00:00",
            "CompletionDateTime": "2021-02-11T11:31:03.875000+00:00"
        },
        "Statistics": {
            "EngineExecutionTimeInMillis": 2559,
            "DataScannedInBytes": 1468006,
            "TotalExecutionTimeInMillis": 2909,
            "QueryQueueTimeInMillis": 235,
            "QueryPlanningTimeInMillis": 488,
            "ServiceProcessingTimeInMillis": 115
        },
        "WorkGroup": "primary"
    }
}

The state is SUCCEEDED, and the results are in s3:///athena/3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv

Getting the data the JSON way

$ aws athena get-query-results --query-execution-id "3d2939f8-3378-4f64-ae0f-01be2dbc842c" --region=$AWS_REGION

Then you can play with jq to get the data you need, but it is hard. Or you can check the line with OutputLocation and copy the results in CSV format using s3

$ aws s3 cp s3://<myS3Bucket>/athena/3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv . 

The file 3d2939f8-3378-4f64-ae0f-01be2dbc842c.csv now in your folder contains the data you seek.

From a script

querystring="SELECT col1, col2, col3 FROM test.frank;" 
queryexecutionid=$(aws athena start-query-execution --region $AWS_REGION --query-string "$querystring" --result-configuration "OutputLocation=$ATHENA_S3_PATH" | jq -r .QueryExecutionId)
debug 'got the queryexecution id [' $queryexecutionid']'

querystatusstate='QUEUED'
getqueryexecutionstring="aws athena get-query-execution --query-execution-id $queryexecutionid --region=$AWS_REGION"

while [ "$querystatusstate" = "RUNNING" -o "$querystatusstate" = "QUEUED" ]
do
    #debug 'command line is [' $getqueryexecutionstring ']'
    qsuerystatusjson=$($getqueryexecutionstring)
    #debug 'qsuerystatusjson is [' $qsuerystatusjson ']'
    querystatusstate=$(echo $qsuerystatusjson | jq -r .QueryExecution.Status.State)
    debug 'querystatusstate value is [' $querystatusstate ']'
done

[ $querystatusstate != 'SUCCEEDED' ] && debug 'unexpected result [' $querystatusstate '] fail' && exit 3
resultlocation=$(aws athena get-query-execution --query-execution-id "$queryexecutionid" --region=$AWS_REGION | jq -r '.QueryExe
cution.ResultConfiguration.OutputLocation')
[ $? -ne 0 ] && error "Dying" && exit 111
s3 cp $resultlocation . 
localfilename="${queryexecutionid}.csv
comments powered by Disqus