Data FAQs

Where do I find out what data is already on the Platform?

The data engineering team maintain a number of databases on the Platform (curated databases). The best way to find out about these is using the data discovery tool (access to the tool is now governed via GitHub; Analytical Platform users have access by default). The tool can be updated by anyone, so if you find something out about the data that isn’t already documented, please do add to it!

In addition to this users can create their own S3 buckets which may have data useful to other teams, you may have to ask around to see if there is an existing dataset that may suit your needs.

How do I gain access to existing data?

Access to curated databases is granted via the database access repository. Have a read through the guidance on there. If you are finding the process a little tricky, please ask for help in #ask-data-engineering. A data engineer will happily guide you through things.

If you are looking for access to a user created bucket, then the admin of that bucket should be able to grant you access. If you don’t know who the admin is, or they are not able to grant you access, then ask in the #analytical-platform-support Slack channel or via GitHub. If the bucket admin is unavailable, the Analytical Platform team will need to receive approval from your line manager before you can be given any access to the bucket.

Where should I store my own data?

Data should be stored in an s3 bucket. You can create a new s3 bucket in the control panel. Data can be uploaded manually via the AWS console (which can be accessed through the control panel) or you can write it from RStudio or JupyterLab.

If your data contains anything that could be considered personal information, you must follow guidance from the data protection team which can be found on the intranet.

How do I read/write data from an s3 bucket?

Python/JupyterLab: You can read/write directly from s3 using pandas. However, to get the best representation of the column types in the resulting Pandas dataframe(s), you may wish to use mojap-arrow-pd-parser.

R/Rstudio: Whilst initially the recommended package was botor, you are also encouraged to try out Rs3tools a community maintained, R-native version of S3tools that removes some of the complexity around using Python.

How do I query a database on the Platform?

Databases on the AP use Amazon Athena which allow you to query data using SQL. You shouldn’t need to know about Athena in detail to query databases on the AP, but if you are interested you may wish to read more about it. There are three ways you can query data (there is more detail on all three of these in Data section of this guidance):

The Amazon Athena workbench: If you log into the AWS console and click Services -> Athena, you’ll see the Athena workbench. This is good for testing your queries.

If you get assumed-role/... is not authorized to perform: glue:GetDatabases, request database access.

Python/JupyterLab: To run queries and/or read data into a pandas DataFrame, use pydbtools. More details are here. Remember to install the latest version!

R/RStudio: There is currently no single recommended package for querying databases in R. There is dbtools which should work on the “old” platform. Rdbtools should work on the “new” platform, but is not officially supported so should be used with caution.

I am running into memory issues, what should I do?

You should do as much data manipulation in Athena/SQL as you possibly can, before reading into your analytical tool of choice. If you are using the curated databases consider filtering out unnecessary rows or columns, or aggregating results if appropriate. The function create_temp_table in pydbtools is particularly useful for helping with this.

If the data is stored in your own s3 bucket, you may wish to create your own Athena database.

How do I create my own Athena database?

Athena workbench/JupyterLab/RStudio: You can run CREATE DATABASE and CREATE TABLE AS SELECT (CTAS) queries to create your own database and tables from data you have in s3. There are more details in this guidance or you can use what is provided by AWS. When running CTAS queries a key thing to remember is to specify the location (s3 bucket and path) of the data. There is a nice example here of setting up your own database. The tutorial is in python but the SQL can be ran from any tool on the AP.

This page was last reviewed on 20 January 2022. It needs to be reviewed again on 20 January 2023 by the page owner #ask-data-engineering .

This page was set to be reviewed before 20 January 2023 by the page owner #ask-data-engineering. This might mean the content is out of date.