In the Airbnb data protection platform
Last year, in another data breach scandal at Airbnb, the data of a few hosts, including personal address information and direct messages, was exposed to other hosts in the app. The Airbnb subreddit consisted of instances of hosts where upon login they were presented with different names and a different inbox, while their co-host saw a second unrelated inbox. The breach was viewed as a technical issue for only a small subset of users by Airbnb.
In a blog post by Elizabeth Nammour, Wendy Jin, and Shengpu Liu, software engineers at Airbnb, the team broke the platform’s automated data protection system. The data is stored on MySQL, Hive and S3; generated, replicated and propagated daily in the centralized inventory system that manages data at Airbnb.
Register for our upcoming Masterclass>>
The data protection platform
The PLR is automated to understand the data and enable its protection. If the software cannot do this, it informs the team to do it manually. Automation primarily focuses on three areas: data discovery, prevention of sensitive data leaks, and data encryption.
It is essential first to discover the personal data to be secured. DPP automatically notifies data owners when it detects their data in data stores, while ensuring that the data is deleted or returned. Data breaches typically occur when API keys or credentials are leaked internally, such as an engineer registering the secret in the server. DPP comes in here in a preventative form to identify potential leaks and notify the engineer to remove the secret from the code, then hide the new secret from the encryption tool sets. Finally, encryption is essential to ensure that infiltrators do not have access to sensitive data. DPP’s encryption service discovers sensitive data instead of relying on manual identification.
Let’s take a look at the different components that make up DPP.
Source: Elizabeth Nammour medium publication
Looking for a job change? Let us help you.
The data classification service is called Inspekt. This is the service that constantly scans sensitive and personal data in beacons in Airbnb’s data stores. Angmar is the secret detection pipeline for the codebase, followed by Cipher, the data encryption service with a transparent framework for Airbnb developers to protect sensitive data. Privacy requests are handled by the Obliviate orchestration service while the Minister deals with the third party risk and privacy service. Madoka is the metadata service that collects the security and privacy properties of data assets from various sources. Finally, the presentation layer is the data protection service that defines the tasks to enable the automation of data protection.
Source: Elizabeth Nammour medium publication
One of the essential layers of data protection, Madoka is a metadata system that maintains security and privacy related metadata across all assets on the Airbnb platform. Madoka’s centralized repository for engineers and other internal stakeholders enables them to track and manage the metadata of their data assets.
Madoka’s core metadata includes list of data assets, ownership and classification of data in MySQL and S3 formats.
Madoka takes care of three essential functions: collecting metadata, storing metadata and providing metadata to other services. Its two main departments initiate them; a crawler and a backend. The Madoka crawler is a daily exploration service that brings metadata from other data sources and publishes it to an AWS Simple Queue Service (SQS) queue. Madoka backend is a data service that ingests this SQS metadata, reconciles any conflicting information, and stores the metadata in its database.
The crawler collects the list of all columns in the AWS MySQL account by calling the AWS APIs to get the list of all clusters and their drive endpoint in the environment. The crawler uses JDBI to connect the endpoint to the cluster and list all databases, tables, columns, and column data types. It keeps this data and transmits it to the Madoka backend for storage. Terraform is used to configure AWS resources in code – it is analyzed by the crawler to retrieve S3 metadata. The crawler uses S3 inventory reports to retrieve tools, enabling inventory reports across all production S3 buckets in Terraform. The crawler keeps information like account number, account name, bucket name, assumed role name, etc., to pass to the backend for storage.
The team uses a metadata property, the property, to describe who owns a specific piece of data. Service property data enables the team to link a data asset to a specific code base and protect actions that require code changes. Additionally, the software enables team membership instead of user ownership to ensure data assets stay with the team for additional protection.
The data classification metadata property describes the type of data elements stored in the asset. It brings the data together to allow the user to understand the risk associated with each set of data to help determine the level of protection needed.
The robot retrieves data classifications from Airbnb’s Git repositories and the automated data classification tool, Inspekt. The output includes data elements found in each asset to ensure constant monitoring and classifications with changing data.
The team created Madoka to be easily extensible, constantly collecting and storing more attributes related to security and privacy. The Airbnb team has taken frantic measures to ensure better data protection and security.
Join our Discord server. Be part of an engaging online community. Join here.
Subscribe to our newsletter
Receive the latest updates and relevant offers by sharing your email.