Amazon Athena uses data source connectors that run on AWS Lambda to execute federated queries. A data source connector is a piece of code that can translate between your target data source and Amazon Athena.
Before you can start querying Amazon DocumentDB with Athena, complete the following steps:
From the Amazon Athena console go to Data Sources.
Select Connect data source.
On the choose where your data is located section, make the following selections:
On the “Connection details: Amazon DocumentDB” section, choose Configure new AWS Lambda function. A new tab will open in the AWS Lambda console.
On the new page, go to Application settings section and follow the below instructions:
AthenaCatalogName: Enter docdb.
DocDBConnectionString: Retrieve your cluster connection string by opening a new browser tab to your Amazon DocumentDB Cluster page, clicking on the Cluster identifier link that has the Primary instance Role, and copy the connection string (under Connect to this instance with an application). Replace the password, accordingly.
SecurityGroupIds: From the Cloud9 environment, run the following command to get the DocumentDB Security Group:
jq -r '. | select(.OutputKey == "DocDBSgId") | .OutputValue' jsonData.json
jq -r '. | select(.OutputKey == "PrivateSubnets") | .OutputValue' jsonData.json
Copy the output of the command, go back to the AWS Lambda console tab, and fill the Subnets Ids.
Leave everything else as default.
Mark I acknowledge that this app creates custom IAM roles and choose Deploy.
This will deploy Athena DocumentDB connector and you can refer to this AWS Lambda function in your queries as “lambda:docdb”. You can close current tab.
Get back to Athena console tab and select the lambda you just created. For Catalog name use docdb and choose Connect. If you do not see the Lambda function gets listed, choose the refresh icon to update the list of functions.
Get back to Query editor section, choose docdb as the Data source on the left and run the following query:
SELECT * FROM "lambda:docdb".sampledb.tweets LIMIT 10;
You will see the data that resides in your Amazon DocumentDB cluster.
To compare the number of rows in both both sources, run the following queries:
For your Amazon DocumentDB cluster:
SELECT count(*) FROM "lambda:docdb".sampledb.tweets;
For your s3 bucket:
SELECT count(*) FROM "change-streams"."sampledb";