A Rosette Cloud script enables you to hide PII in your documents and data
Often organizations need to share documents and information that may include personally identifiable information (PII), whether out of good conscience or by legal mandate. Going through documents manually to identify and remove all potentially compromisable data is time consuming, expensive and inefficient, but failing to do so can lead to dire financial and organizational repercussions.
To help organizations protect against damaging leaks, Rosette Cloud now enables identity masking, a way to redact personally identifiable information from your documents so that they can be safely stored and shared.
For example, if you pipe in the string:
$ echo 'John Smith is accused of stealing $1,000,000.' | ./mask_identities.py
The following is returned:
PERSON1 is accused of stealing IDENTIFIER:MONEY.
Rosette first extracts the entities in your text data, then the script enables you to hide people, locations, organizations, products, titles, nationalities, religions, and more from your documents so that the individuals or organizations involved cannot be identified.
Protect against damaging data leaks
Another day, another data breach. As we increasingly live our lives online, trusting dozens of websites with credit cards, social security numbers, bank accounts, phone numbers, and even seemingly innocuous information like a beloved pet’s name or our alma mater, we also put ourselves at risk.
On a corporate level, companies and organizations that users entrust with personally identifying information have an obligation to protect that data. Not only does data masking protect your customers, but it limits the damage that can be done when an inevitable breach does happen. Target settled with customers whose credit card info was compromised for $18.5 million and Ashley Madison will be paying $11.2 million to its hacking victims. And that cost doesn’t factor in lost sales to now wary customers and the hit both companies took to their reputations.
If Target, Ashley Madison, and dozens of other companies that have been hacked had masked the most compromising bits of their data, millions of dollars could have been saved.
Share documents publicly without violating privacy
With so many diverse organizations handling sensitive information, identity masking can benefit dozens of use cases, including:
- Police departments publishing crime reports
- Law firms exchanging documents in the e-discovery process
- Banks tracking customer assets
- Market intelligence researchers studying consumer trends
- US Census Bureau analysts
Especially in circumstances where private data redaction is obligatory, quickly and cost-effectively managing sensitive information is a must.
Identity masker for Rosette Cloud
Visit our GitHub repository to find instructions and sample Python code demonstrating how to use Rosette Cloud entity extraction results to mask personally identifying information in text.
You can use the script from the commandline as follows:
$ ./mask_identities.py -h usage: mask_identities.py [-h] [-i INPUT] [-u] [-k KEY] [-a API_URL] [-l LANGUAGE] [-t TYPE [TYPE ...]]
-h, --help show this help message and exit -i INPUT, --input INPUT Path to a file containing input data (if not specified data is read from stdin) (default: None) -u, --content-uri Specify that the input is a URI (otherwise load text from file) (default: False) -k KEY, --key KEY Rosette API Key (default: None) -a API_URL, --api-url API_URL Alternative Rosette API URL (default: https://api.rosette.com/rest/v1/) -l LANGUAGE, --language LANGUAGE A three-letter (ISO 639-2 T) code that will override automatic language detection (default: None) -t TYPE [TYPE ...], --entity-types TYPE [TYPE ...] A list of named entity types to mask (refer to https://developer.rosette.com/features-and-functions#entity-extraction-entity-types for a full description of supported entity types) (default: ['ORGANIZATION', 'PERSON', 'IDENTIFIER:CREDIT_CARD_NUM', 'IDENTIFIER:EMAIL', 'IDENTIFIER:MONEY', 'IDENTIFIER:PERSONAL_ID_NUM', 'IDENTIFIER:PHONE_NUMBER', 'TEMPORAL:DATE', 'TEMPORAL:TIME', 'IDENTIFIER:LATITUDE_LONGITUDE'])
Note, this script is only for demonstration purposes. You should NOT assume that the results are perfect nor that all personally identifying information has been removed. Rosette’s entity extraction uses statistical machine-learned models, so it would not be prudent to trust it wholesale. Rather entity extraction and masking with Rosette is a valuable first step, best if followed with human review.
If you prefer not to enter your Rosette Cloud API key every time you run the script you can set up an environment variable $ROSETTE_USER_KEY.
If there are multiple mentions of any of the following entity types, they will be indexed so that distinct entities can still be distinguished, even if they can’t be identified:
$ echo "John Smith is accused of stealing \$1,000,000. Jane Smith was John's accomplice." | ./mask_identities.py
PERSON1 is accused of stealing IDENTIFIER:MONEY. PERSON2 was PERSON1's accomplice.
Ready to start protecting the identities in your data?
First, make sure you have a Rosette Cloud key. It’s free for a 30-day free trial!, and you don’t need to enter a credit card to sign up. Next, head over to our community GitHub for a detailed walkthrough and to download the script.