Posts MLSteal
Post
Cancel

MLSteal

Source Code

Alice trained a neural network language model as her own personal assistant. Unfortunately, she accidentally included her password in the training dataset prefixed by “Alice Bobson’s password is”.

What’s Alice’s password?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/usr/bin/python3

import sys
import gzip
import pickle
import numpy as np

# Import our fancy neural network
from mlsteal import sample


if __name__ == "__main__":
    # Load the model weights
    weights = np.frombuffer(gzip.open("weights.gz","rb").read(), dtype=np.float32)

    # Get model prompt from first argument
    if len(sys.argv) == 1:
        print("Please pass one argument containing a string to prompt the language model")
        exit(1)
    prompt = bytes(sys.argv[1],'ascii')
    x = np.array(list(prompt), dtype=np.uint64)

    # Run neural network on the prompt
    y = sample(x, weights)

    # Print to stdout
    print(b"".join(np.array(sample(x, weights),dtype=np.uint8)))

Solution

From the source code, there isn’t really any attack vector. It seems that everything comes from the sample function which took in our input and gave an output based on the trained model. We then needed to exploit the model in some way.

The idea

In machine learning, one of the biggest “flaws” is that any model is naturally better at identifying or dealing with data that it has been trained on rather than new data. In image classification that lead to membership inference attacks and information leakage through GANs.

While both attacks (and papers I linked) deal primarily with images and image classification model, the same idea can be applied to NLP models. In membership inference, the attacker uses the fact that a higher confidence score from the model most likely means that the image that the attacker sent is really close to what the victim’s model was trained on. Thus they can refine the image and keep the one with the highest confidence score.

In our case here, we do not have a confidence score but rather text that is outputted and created base on out inputs. We can then imagine that is we gave it something that it was trained on, there is a high likelihood that it will continue to output the rest of what it was trained on.

The Payloads

prefixed by “Alice Bobson’s password is”.

As part of the introduction to the challenge, we are given this line. We can guess that the start of her password which is the flag is CTF{

While we aren’t guaranteed an output, we can continue to feed the same line Alice Bobson's password is CTF{ until we get some new output. Using this technique, I recovered CTF{m3m0r1z4t10n-1S-4LL

1
2
3
4
5
6
7
8
9
10
11
root@57f64727c54a:/ctf# python3 query.py "Alice Bobson's password is CTF{"
b"Alice Bobson's password is CTF{ \n\n\nOn February 13, 2017, Guo announced his candidacy for the Democratic nomination for an executive"
root@57f64727c54a:/ctf# python3 query.py "Alice Bobson's password is CTF{"
b"Alice Bobson's password is CTF{ \n\n<br>\n\n\n\nChurch John Bloomsbury\n\n\n\n\n\n\n\nBobby Bush Arrondissement\n\n\n\nBobby Bush of Bush is a "

---SNIP---

root@57f64727c54a:/ctf# python3 query.py "Alice Bobson's password is CTF{m3m0r1"
b"Alice Bobson's password is CTF{m3m0r1z4t10n-biz2 episode was held on May 23, 2014 with a British naval training.\n\n\n\nBrigham Gregg\n\n\n\nBrig"
root@57f64727c54a:/ctf# python3 query.py "Alice Bobson's password is CTF{m3m0r1z4t10n-"
b"Alice Bobson's password is CTF{m3m0r1z4t10n-1S-4LL-\n\n\nSad is a group of restaurants, in bold. The center of the Bolton Berlin Tree was created a"

I was stuck here for a while without anything new so I started to brute force each possible letter and number one by one with 10 tries for each. I then managed to get the rest of the flag:

1
2
root@57f64727c54a:/ctf# python3 query.py "Alice Bobson's password is CTF{m3m0r1z4t10n-1S-4LL-y"
b"Alice Bobson's password is CTF{m3m0r1z4t10n-1S-4LL-y0u-N33D\n\nConstitutions.\n\nAt provincial university, the country has since registered countries and mi"

Now we have the flag:

CTF{m3m0r1z4t10n-1S-4LL-y0u-N33D}

References

  1. Briland Hitajm, Giuseppe Ateniese, and Fernando Perez-Curz. Deep models under the gan: Information leakage from collaborative deep learning. 2019.
  2. Liwei Song, Reza Shokri, and PrateekMittal.Membership inference attacks against adversarially robust deep learning models. 2019 IEEE Security and PrivacyWorkshops (SPW), 2019.
This post is licensed under CC BY 4.0 by the author.