Where data owners and data scientists can securely collaborate without exposing data – opening the way to projects that were too risky to consider.
βοΈ What is BastionLab?
BastionLab is a simple privacy framework for data science collaboration, covering data exploration and AI training.
It acts like an access control solution, for data owners to protect the privacy of their datasets, and stands as a guard, to enforce that only privacy-friendly operations are allowed on the data and anonymized outputs are shown to the data scientist.
Data owners can let external or internal data scientists explore and extract values from their datasets, according to a strict privacy policy they’ll define in BastionLab.
Data scientists can remotely run queries on data frames and train their models without seeing the original data or intermediary results.
BastionLab is an open-source project.
Our solution is coded in Rust π¦, uses Polars π», a pandas-like library for data exploration, and Torch π₯, a popular library for AI training.
We also have an option to set-up confidential computing π, a hardware-based technology that ensures no one but the processor of the machine can see the data or the model.
π Quick tour
You can go try out our Quick tour in the documentation to discover BastionLab with a hands-on example using the famous Titanic dataset.
But hereβs a taste of what using BastionLab could look like π
Data exploration
Data owner’s side
)
# Define a custom policy for your data.# In this example, requests that aggregate at least 10 rows are safe.# Other requests will be reviewed by the data owner.>>>frombastionlab.polars.policyimportPolicy, Aggregation, Review>>>policy=Policy(safe_zone=Aggregation(min_agg_size=10), unsafe_handling=Review())
# Upload your dataset to the server.# Optionally anonymize sensitive columns.# The server returns a remote object that can be used to query the dataset.>>>frombastionlabimportConnection>>>withConnection("bastionlab.example.com") asclient:
... rdf=client.polars.send_df(df, policy=policy, sanitized_columns=["Name"])
... rdf
...
FetchableLazyFrame(identifier=3a2d15c5-9f9d-4ced-9234-d9465050edb1)
Data scientist’s side
)
>>>all_remote_dfs=connection.client.polars.list_dfs()
>>>remote_df=all_remote_dfs[0]
# Run unsafe queries such as displaying the five first rows.# According to the policy, unsafe queries require the data owner's approval.>>>remote_df.head(5).collect().fetch()
Warning: nonprivacy-preservingqueriesnecessitatedataowner'sapproval.
Reason: Only1subrulesmatchedbutatleast2arerequired.
Failedsubrulesare:
Rule#1: Cannot fetch a DataFrame that does not aggregate at least 10 rows of the initial dataframe uploaded by the data owner.Anotificationhasbeensenttothedataowner. Therequestwillbependinguntilthedataowneracceptsordeniesitoruntiltimeoutsecondselapse.
Thequeryhasbeenacceptedbythedataowner.
shape: (5, 12)
βββββββββββββββ¬βββββββββββ¬βββββββββ¬βββββββ¬ββββββ¬βββββββββββββββββββ¬ββββββββββ¬ββββββββ¬βββββββββββ
β PassengerId β Survived β Pclass β Name β ... β Ticket β Fare β Cabin β Embarked β
β --- β --- β --- β --- β β --- β --- β --- β --- β
β i64 β i64 β i64 β str β β str β f64 β str β str β
βββββββββββββββͺβββββββββββͺβββββββββͺβββββββͺββββββͺβββββββββββββββββββͺββββββββββͺββββββββͺβββββββββββ‘
β 1 β 0 β 3 β null β ... β A/521171 β 7.25 β null β S β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 2 β 1 β 1 β null β ... β PC17599 β 71.2833 β C85 β C β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 3 β 1 β 3 β null β ... β STON/O2. 3101282 β 7.925 β null β S β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 4 β 1 β 1 β null β ... β 113803 β 53.1 β C123 β S β
βββββββββββββββΌβββββββββββΌβββββββββΌβββββββΌββββββΌβββββββββββββββββββΌββββββββββΌββββββββΌβββββββββββ€
β 5 β 0 β 3 β null β ... β 373450 β 8.05 β null β S β
βββββββββββββββ΄βββββββββββ΄βββββββββ΄βββββββ΄ββββββ΄βββββββββββββββββββ΄ββββββββββ΄ββββββββ΄βββββββββββ
# Run safe queries and get the result right away.>>> (
... remote_df
... .select([pl.col("Pclass"), pl.col("Survived")])
... .groupby(pl.col("Pclass"))
... .agg(pl.col("Survived").mean())
... .sort("Survived", reverse=True)
... .collect()
... .fetch()
... )
shape: (3, 2)
ββββββββββ¬βββββββββββ
β Pclass β Survived β
β --- β --- β
β i64 β f64 β
ββββββββββͺβββββββββββ‘
β 1 β 0.62963 β
ββββββββββΌβββββββββββ€
β 2 β 0.472826 β
ββββββββββΌβββββββββββ€
β 3 β 0.242363 β
ββββββββββ΄βββββββββββ
AI training
Data owner’s side
, train=True, transform=transform, download=True)
Filesalreadydownloadedandverified>>>test_dataset=CIFAR100("data", train=False, transform=transform, download=True)
Filesalreadydownloadedandverified# Send them to the server by instantiating a RemoteDataset.>>>withConnection("localhost") asclient:
... client.torch.RemoteDataset(train_dataset, test_dataset, name="CIFAR100")
...
SendingCIFAR100: 100%|ββββββββββββββββββββ|615M/615M [00:04<00:00, 150MB/s]
SendingCIFAR100 (test): 100%|ββββββββββββββββββββ|123M/123M [00:00<00:00, 150MB/s]
<bastionlab.torch.learner.RemoteDatasetobjectat0x7f1220063ac0>
Data scientist’s side
>>>fromtorchvision.modelsimportefficientnet_b0>>>frombastionai.clientimportConnection# Define the model>>>model=efficientnet_b0()
# List the datasets made available by the data owner, select one and get a remote object.>>>connection=Connection("localhost")
>>>remote_datasets=connection.client.torch.list_remote_datasets()
>>>remote_dataset=remote_datasets[0]
# Send the model to the server by instantiating a RemoteLearner# The RemoteLearner objects references the RemoteDataset.>>>remote_learner=connection.client.torch.RemoteLearner(
... model,
... remote_dataset,
... max_batch_size=64,
... loss="cross_entropy",
... model_name="EfficientNet-B0",
... device="cpu",
... )
SendingEfficientNet-B0: 100%|ββββββββββββββββββββ|21.7M/21.7M [00:00<00:00, 531MB/s]
# Train the remote model for given amount of epochs>>>remote_learner.fit(nb_epochs=1)
Epoch1/1-train: 100%|ββββββββββββββββββββ|781/781 [04:06<00:00, 3.17batch/s, cross_entropy=4.1798 (+/-0.0000)]
# Test the remote model>>>remote_learner.test(metric="accuracy")
Epoch1/1-test: 100%|ββββββββββββββββββββ|156/156 [00:14<00:00, 10.62batch/s, accuracy=0.1123 (+/-0.0000)]
ποΈ Key features
Access control: data owners can define an interactive privacy policy that will filter the data scientist queries. They do not have to open unrestricted access to their datasets anymore.
Limited expressivity: BastionLab limits the type of operations that can be executed by the data scientists to avoid arbitrary code execution.
Transparent remote access: the data scientists never access the dataset directly. They only manipulate a local object that contains metadata to interact with a remotely hosted dataset. Calls can always be seen by data owners.
BastionLab is still in development. Do not use it yet in a production workload. We will audit our solution in the future to attest that it enforces the security standards of the market.
π License
BastionLab is licensed under the Apache License, Version 2.0.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
Leave A Comment