Cracking the Data Deluge: Training a New Generation of Data Scientists

As society becomes increasingly digital, it is generating unimaginable amounts of data—the electronic footprint of humanity’s actions, thoughts, and feelings. This data can be harnessed for good, or for not-so-good, just like any new resource. It is the job of educational institutions to teach people the right way to derive value from data⁠—keeping fairness, privacy, and morality at the centre of all their development processes. Data Science has been repeatedly called the “sexiest” new industry of this century. The problem is, there is no standardized, widespread post secondary degree for Data Science. 

The current pool of data scientists come from all kinds of quantitative backgrounds⁠—physics, mathematics, engineering, and economics, to name only a few. These data scientists tend to take the programming, problem solving, and statistics skill sets learned in their post-secondary degrees, and transfer those skill sets to which ever industry needs them to solve new challenges with large, complicated, and often unstructured datasets. 

Even with a hefty amount of training, there is typically a large disconnect between the expectations of the data scientist prospect and the business looking for value. The prospect looking to transition from academia to industry may believe that they will be provided an endless supply of clean data upon which they can perform some quick modelling and analysis to collect their paycheck. The business, on the other hand, will likely expect a fully functioning tool that they can tap into for on-demand insights from scattered data sources. This misalignment of expectations leads to a lot of friction that can be discouraging to both parties. The prospect will be frustrated with their unutilized work, while the business will be struggling to derive value from the model/reports provided to them. How do we bridge the divide between expectations? How do we better prepare our data science prospects for the reality of a fast paced business environment?

The answer to these questions is to develop teaching resources that will fill this crucial gap in know-how, until well-established data science degrees emerge. Data is driving the fourth industrial revolution, and proper training is necessary to make sure the ‘pollution’ and unintended consequences of processing this new resource are minimized.

ATB’s internal Artificial Intelligence Guild (as well as the organization as a whole), is dedicated to fostering a robust, world class talent pool of data scientists in Alberta; people who are skilled craftsmen when it comes to slicing into data sets and extracting meaningful insights. 

When the University of Calgary hosted a Research Computing Summer School on behalf of WestGrid to address the growing market need for data science training in May 2019, ATB sent data scientists to help develop and present curriculum.

The intensive four-day long session had three streams: 

  1. Scientific and Parallel Computing in Linux
  2. Selected Topics in Data Analysis and Code Acceleration
  3. Applied Machine Learning and Data Science Using Python and MATLAB

The summer school was intended for researchers and students from all disciplines⁠—everything from Fine Arts to Physics—to get a taste of what’s required to be successful at applied data science. Participants were able to attend introductory and advanced courses in high performance computing, parallel programming, machine learning, research computation, and scientific data visualization. 

Two ATB Data Scientists from the Artificial Intelligence Guild taught a session in the Machine Learning Stream on “Gathering and Using Unstructured Data.” The class was attended by roughly 70 students with backgrounds in about 30 different research areas⁠—biology, engineering, physics, business, neuroscience, computer science, and medicine, to name only a handful. 

Chaos to Order. Teaching a Process for Mining Unstructured Data:
One of the primary challenges of moving from pure academia to industry data science is the actual gathering of data itself. It is implied, when taking a standard data science bootcamp, that the employer will deliver a clean, large, structured dataset, which can easily be imported into one’s favorite programming environment, and within minutes the insights from machine learning models are flowing.

In reality, data sources are often unstructured, unclean, and strewn across a variety of disjoint sources such as: legacy data warehouses, open data portals, and commercial APIs. Showing students how to gather this data in a real-world setting, consolidate it in one place, and then model it to understand something about the universe is not often taught all in one sitting. Often only an incomplete portion of a fully functioning data science ‘stack’ is taught to students in each lesson, resulting in uncertainty about how to take on an end-to-end project or challenge.

The first task for the class involved having the students open a cloud computing environment (utilizing Google Cloud Platform’s free trial, in most cases) and build (and run) a docker container based on Ubuntu 18.04 with a JupyterLab interface. This created a ready-made Python programming sandbox for students, some of which had never before experienced the command line. 

From there, the students were able to jump in feet-first to the task of building a model using data sourced from an Application Programming Interface (API). Connecting to an API, pulling the data, and cleaning it, gave the students a feel for how to interface with an external data store, and make sense of the returned data structure.

The ‘cartoon example’ was quite literally, a cartoon example⁠—the students followed along as a custom predictive model was built to try to determine the outcome of a March-Madness style competition of sixty-four cartoon shows, where pairs of cartoons would ‘battle’ each other to determine a winner based on popularity. This example was meant to answer the question of whether it was possible to predict what show would come out on top as the highest voted (most popular) cartoon, given 64 competing cartoon shows

The actual competition is one that was hosted by a local radio show in Edmonton, Alberta, so the setup of the competition (and resulting outcomes) were available to compare against the model results. The data used for the model was sourced from an unofficial IMDb client library, using various metrics such as overall IMDb score, number of seasons, number of votes, and age of the show. 

Figure 1.1: Actual results for the March Madness of Cartoons competition.

In the end, the students were shown how to go end to end through a data science application, from data collection, cleaning, modelling, to insights⁠—in this case, a simulation of expected tournament outcomes for a radio show competition. 

The next module involved training in a very active area of research and application⁠—optical character recognition (OCR) and natural language processing (NLP). Using an open dataset from a Kaggle competition, sentences from three ‘spooky’ authors (Edgar Allan Poe, HP Lovecraft, and Mary Shelley) were converted from text into images of text, of varying rotations, before the class began. 

The students then followed along as a full-stack OCR project was undertaken. The process began by using a Python wrapper for Tesseract OCR to clean the images, orient them properly, and extract the text from the images, converting them into strings that could be stored in a database. Using a training dataset (a subset of the data where the sentences and their authors were matched), a machine learning classifier was trained to decipher the nuances associated with each of the three authors based purely on samples of their writing. The final step was testing this machine learning classifier on a “holdout” dataset, to determine how accurate it was in classifying the expected ‘spooky author’ of an excerpt, using only the sample of writing and nothing else!

In essence, the students found it was possible to teach a computer to read text off an image, then train the computer to guess the author of a sample of text by only showing it pre-labelled excerpts (in this case labelled by author). This type of process has applications in any industry where scanned documents are common, or where computer vision is required to make sense of stored images. In the finance industry, OCR is often applied to extract text from scanned images of financial statements or customer account documents. Then, text analytics or natural language processing is applied to the text to derive insights about the customer, or extract relevant account information for storage in a flat-file database. 


Figure 1.2: Example of image of text (upside down) and extracted text, plus prediction from the machine learning model (Edgar Allan Poe) and actual author (Edgar Allan Poe).

In addition to these two primary use cases, the students also saw a variety of other examples: Collecting data from a bitcoin data API and plotting cryptocurrency price trends, getting data on SpaceX launches and tracking total payload weights each year, gathering images from open-source datasets and applying computer vision cleaning techniques; these were just some of the end-to-end use cases that the participants were able to explore. 

Figure 1.3: Bitcoin prices (CAD) extracted from an API and plotted for analysis.

On top of the formal curriculum, there was a lot of informal discussion happening during the session and afterwards. ATB Financial’s data scientists were able to share their experience with transitioning from academia to industry, and offer professional and personal advice to aspiring data scientists. While there is the pure enjoyment aspect of volunteering, there is also a very real professional responsibility for current data scientists to reach out and educate the next generation. 

This will result in fresh new talent with the required skill sets to hit the ground running, if these students finish with their academic journeys and decide the enter industry. Training and outreach is a very wise medium-to-long-term investment for companies currently undertaking data science projects. The field of data science is ever-changing and ever-growing, meaning the best way for students to keep current about new tools and techniques is to learn directly from those applying these techniques to solve business problems. We’re even investing in partnerships with universities to fuel the future of AI and ML and push the boundaries of human potential. By participating in the education of the next generation of data scientists, ATB Financial is ensuring that the talent required to prepare Alberta for the future is grown right here at home.

Stay up to date with what our Transformation team is up to⁠—subscribe to alphaBeta below.

We are ATB transformation - innovating at the forefront
of robotics, AI, blockchain and the future.