Since the data of the data set belongs to multi-dimensional input, in order to have a visual understanding of the data before actually using the data, we choose to use T-SNE as a data pre-processing.
The input of the T-SNE algorithm includes two aspects, one is the data itself, and the other is perplexity. Perplexity can greatly change the visual effect we get. Higher perplexity means that the algorithm treats more high-dimensional data points as adjacent data points, while low perplexity is the opposite.
Result with perplexity=2 Result with perplexity=5 Result with perplexity=30
Conclusion: From the above results, we can conclude that categories 0-4 have obvious clustering phenomenon, but category 5 (Used in Last Week) and category 6 (Used in Last Day) are difficult to separate different data.
Research question: It has been observed that alcohol use disorder is frequently related to personality. Intuitively, the reasoning could be in both ways. Alcohol use disorder may cause personality change, while personality could also lead to alcohol use disorder. We suppose there is a very complicated relationship between personality and misuse of alcohol and they may even tangle with each other to make worse impact. Here, we start with investigating how personality cause alcohol use disorder.
Problem formulation: We will use a machine learning model to analyze how one person’s personality may potentially lead to alcohol misuse.
Data representation: Text/Binary/Numeric?
Preprocessing/Visualization: Mapping the data into 2D to see if they are potentially separable
White-box (Logistic Regression, Support Vector Machine, Decision Tree, Random Forest), more interpretable, but normally lower performance if we have a lot of data
Black-box (Deep neural networks), less interpretable, but normally higher performance if we have a lot of data