Because nearly every business is using big data to power their decision making and business processes, the demand for data scientists is growing rapidly. According to GetEducated.com, “the number of data scientists doubled over the last four years, and some even quote the growth at 300%” Companies understand how valuable customer data is, and are leveraging data scientists to maximize that data’s value. We think it is each data scientist’s responsibility to do their job with their company’s goals in mind, but with respect for the individual user’s privacy.
Data Scientists and Privacy
Data scientists obviously need data to study, but they should also make user privacy a priority to protect their customers from data breaches and other security threats.
“Privacy is a data management problem with a business process wrapped around it, which culminates in a data governance strategy for an organization.” –Data Privacy & Data Science: The Next Generation of Data Experimentation
If personally identifiable information is not absolutely necessary for understanding a given data point, a good data scientist should remove that information from the company’s database. It’s risky for an organization to store people’s information in a database in plain-text, especially if that information includes personal information like people’s names or financial information.
Obfuscate Personal Information: Obfuscation is like ‘disguising’ the data in a database. Rather than seeing an email address as ‘user@example.com’, in obfuscated form it may appear as ‘XXXX@example.com’. Obfuscating data helps to protect individual user privacy, while still giving data scientists the capability to analyze the data. Databases can be configured to give certain users access to only obfuscated or masked data, while giving administrators access to the raw or real data.
Read More: What is Obfuscation? – Hackernoon
If a company uses SQL Server or Azure SQL DB, these both have a built-in feature that limits access to sensitive data field: Dynamic Data Masking (DDM). DDM allows users to define particular columns and then determine how the data in those columns appears when queried. This doesn’t actually change the data in the database, but can give an output that is still useful, but without key personal details.
What Do Data Scientists Do?
Data Scientists comb through large amounts of data and discover valuable conclusions to help companies make smarter business decisions. They also play a role in optimizing the data mining techniques that companies use to get information about customers and business processes.
Specific tasks for Data Scientists include:
- Identifying the data-analytics problems that offer the greatest opportunities to the organization
- Determining the correct data sets and variables
- Collecting large sets of structured and unstructured data from different sources
- Cleaning and validating the data to ensure accuracy, completeness, and uniformity
- Devising and applying models and algorithms to mine the stores of big data
- Analyzing the data to identify patterns and trends
- Interpreting the data to discover solutions and opportunities
- Communicating findings to stakeholders using visualization and other means
The Beauty of Data Visualization – David McCandless
Data scientists report their findings in a simple and easy-to-understand format. Privacy comes into play here as well. If reporting includes personally identifying information of any individual users, that is not privacy-friendly because it gives access to anyone viewing the presentation. Data scientists must navigate the best ways to present their findings without compromising user privacy.