Identifying Fraud from Enron Data

Objective

This project will use machine learning to attempt to identify persons of interest from the Enron financial scandel. Persons of Interest (POIs) were defined at the onset by Udacity using an article from USA Today . Two data resources were provided by Udacity for identifying POIs. The first is the Enron Email Datset maintained by Carnegie Mellon University . The email dataset contains a corpus of close to 500,000 email messages from Enron's servers. The second is a dataset created by Udacity with a collection of useful information for the task. It contains both financial features as well as aggregate information about the email dataset.

Machine learning excels at this type of task because it can be used to process large amounts of information in a way that would be time consuming if done by hand. Machine learning algorithms use statistical methods to learn from data and can then make inferences based on the relationships they have found. Although developing the algorithms takes a considerable amount of care and effort in itself, when completed they can be used to process large datasets that would take days for a human to review within seconds.

Data Exploration

The data was loaded into pandas dataframes to simplify exploration and visualization. There were 146 entries and 21 features. 18 (12%) of the entries were identified as persons of interest.

The data was created from two sources. The financial information came from a document attributed to Findlaw.com. Whereas the email information was compiled by aggregating information from the Enron Email corpus. For this reason it made sense to divide the data when looking for trends and similarities between the features.

Outlier Investigation

To search for outliers I screened for any values with a zscore above 3. I also visualized the results to get a feel for the distributions. The largest outliers were created by the presence of a 'TOTAL' entry in the financial data. This entry prompted me to seach through the Findlaw.com document for other unusual items. I also found a travel agency that needed to be removed.

Plotting the data showed that most of the financial features were positively skewed. This made sense, considering the presense of executives and directors in the data. Their compensation was far beyond the rest of the employees. They also receive unique types of compensation such as restricted stock and director fees. These outliers did not represent errors were kept.

The email data also showed some outliers, which was expected. Individuals who regularly contacted POIs should stand out from those who didn't because the POIs only make up a small portion of the email corpus. However, the persons who were identified as outliers for non-POI related fields represent a risk in data leakage. I did some searching and confirmed that these individuals held high level positions at Enron. It is likely they had a higher presense in the email corpus because they were the focus of investigations. This created a potential problem where the focus of the investigation could leak in as a factor in identifying POIs, which I planned to look out for in feature selection.

Feature Selection

Create new features

One theory I had was that incidents like the Enron scandal are created by systemic problems within companies. For example, in the case of Enron employee compensation could have been structured in a way that rewarded bad actors.

I created a two new features to search for compensation schemes that might have been used as performance incentives. The idea being that if these incentives were being used to reward the most competitive employees, the 'cheaters' might have risen to the top.

  1. bonus_ratio: bonus / salary
    • How large was the bonus in relation to salary?
  2. stock_ratio: total_stock_value / total_payments
    • Did any employees receive large amounts of stock in comparison with more standard compensation payments?

I also created two email ratios to try to prevent the potential data leakage I identified during outlier investigation. These features normalize the email interactions with POIs so that they are less influenced by the total number of emails for each person in the database.

  1. from_this_person_to_poi_ratio: from_this_person_to_poi / from_messages
  2. from_poi_to_this_person_ratio: from_poi_to_this_person / to_messages

After creating these variables there were 23 features in the dataset, not counting 'poi' and 'email_address'. Email addresses were removed as they had no value for the approach I chose.

Dimensionality Reduction

Dimensionality reduction was an important part of the process because of the relatively small number of entries in the dataset. Using too many features with a small dataset increases the chance of overfitting. In addition, the number of financial features with sparse data made performing cross validation difficult. Data is considered sparse when it has a high percentage of zeros for values. This can cause a cross validator to return drastically different results depending on where the division of training and validation sets is made, and can consequently lead to overfitting.

To reduce these problems I used sklearn's f_classif function to identify the features with the lowest p-values. I also examined features with high percentages of sparse data. After inspection I decided to drop all of the features with greater than 60% sparse data and p-values over .05. As a last manual step I removed two features which had a high correlation (>.9) to other features in the dataset. These were total_stock_value and total_payments. Removing these was a natural choice, since they are composites of the other features and do not add information.

With this filtered list I was able to perform a grid search to select the features that I wanted to use. The grid search utilized a pipeline to perform an exhaustive search of all the features against several algorithms. The first step of the pipeline was to standardize the data. Standardization was necessary to scale the data so that it was appropriate for all of the algorithms. Sklearn's MinMaxScaler was used because of its ability to handle outliers . Features were selected using SelectKBest to pick the best k features based on thier ANOVA F-values. A total of 6 features made the final selction. Their F-values are in the table below.

F-value
exercised_stock_options 25.097542
bonus 21.060002
salary 18.575703
from_this_person_to_poi_ratio 16.641707
bonus_ratio 10.955627
long_term_incentive 10.072455

Algorithm Selection

Four algoritms were tested as part of the grid search performed for feature selection. They were LinearSVC, KNeighborsClassifier, DecisionTreeClassifier and AdaBoostClassifier. The grid search identified the best classfier based on the maximum f1_score that was obtained from all of the tests.

f1_score
LinearSVC 0.21133
KNeighborsClassifier 0.302
DecisionTreeClassifier 0.37257
AdaBoostClassifier 0.33505

DecisionTreeClassifier scored the highest in the gridsearch

Algorithm Tuning

Algorithms often have a variety of paramaters that can be tuned to optimize performance. Proper tuning is important in balancing the bias-variance tradeoff of the algorithm . If the algorithm is tuned too aggressively it will perform well on the training set, but poorly on new data. Conversely, if the algorithm is tuned too conservitively it will perform equally on training and test data, but at the same time not acheive all of its potential accuracy.

To tune the DecisionTreeClassifier I created a second gridsearch to test the following parameters and options:

  • class_weight:
    • None, balanced
  • criterion:
    • Gini, Entropy
  • splitter:
    • Best, Random
  • max_depth:
    • None, 7, 6, 5, 4

The best score was obtained using the settings 'balanced', 'Entropy', 'Best', and 4. The effect was to raise the f1 score to 0.52538.

Validation

Validation is the process of testing algorithm performance and parameter tuning. A classic mistake is to test an algorithm's performance on the data it was trained and optimized with. It usually results in very high test scores but poor performance on unknown data. Choosing a robust validation strategy is necessary to ensure that a machine learning algorithm will be able to make predictions when new data is introduced.

I used the same StratifiedShuffleSplit cross-validation tool as in the tester.py script provided by Udacity. It lends itself well to this project because of the small size of the dataset. It creates random splits of test and train data while maintaining the ratio of classes so that each split is representative of the whole dataset. The random splitting means that it is possible that data is reused between different splits, but it also allows it to create many more combinations for testing than would be possible if data reuse was not allowed. I implemented the approach from within the GridSearchCV cross-validators I used for model selection and tuning.

Metrics

I used the f1 score as the performance objective of model selection and tuning. It is a better tool than measuring accuracy because of the imbalanced nature of the data. It is the weighted average of precision and recall, which are coincidentally the grading metrics of the project. Precision measures how accurate the model is when it makes a prediction, whereas recall measures the ability of the model to make the correct prediction given a class. A model can have a high precision for a class by being conservative in how it predicts that class, while at the same time be penalized for not predicting the class often enough when it occurs. The f1 score balances this tradeoff.

The tester.py returned the following metrics for the final algorithm:

  • Accuracy: 0.765
  • Precision: 0.34135
  • Recall: 0.5675
  • F1 score: 0.42629