My name is Mohamed Fathallah. I am a 3rd year engineering student at ESPRIT, majoring in Artificial Intelligence. I look forward to being an important asset in the AI industry.
What was the competition about?
The BUSINESS & AI 2nd Online ML Competition consisted of a Machine Learning challenge where participants have to provide the most optimized results to win. It was about a regression problem where the goal is to minimize the Root Mean Squared Error (RMSE) of a dataset containing numerical, categorical and textual data. The competition took place online on an extended period of 5 hours. The challenge was open to anybody who is interested in Machine Learning and in Artificial Intelligence with no prior experience required.
The background prior to entering the challenge
I am a college student, majoring in Computer Science Engineering and more precisely in Artificial Intelligence. Prior to this year where I chose my major, I did not have experience in the field of AI. I was always curious about new technologies, which naturally attracted me to this field. AI is a branch that constantly grows and gains impact on the world. In a few years, everything will be AI-powered, starting from banks to even going to get groceries. It is a powerful tool that helps humanity in improving constantly their daily life.
For me, being part of this industry meant being at the edge of technology. Starting from that motivation, I started to get more and more involved with this environment and slowly I integrated it fully by self-teaching on the internet through specializations and online courses, and later on, with college education. As a result, before entering this challenge, I had an extensive knowledge about Machine Learning algorithms and I am comfortable with coding in pythons and its libraries.
What made me decide to enter the competition
Since I am always on the look-out for new opportunities to grow and to learn, I stumbled upon the competition on LinkedIn and found it extremely interesting. Competitions are for me the perfect way to challenge myself and to cultivate my knowledge. I was able to enjoy it and gain more experience from it.
Approaching the problem and structuring the solution
At first, I started to understand what the data consisted of; Numerical, categorical and, textual variables.
For the cleaning phase, I removed all the textual data because I judged that it would be too time consuming to decrypt all of it seeing the short period of time we were allocated. Then, I inspected the values and decided to drop the variables that didn’t have an informative significance like constant values throughout all rows or numerical values with very little variance. I also chose to drop columns that had more than 70% missing values.
To fill columns with some missing values I used iterative imputer that is a multivariate imputing strategy. Each step designates a feature column as output y, while the other feature columns are treated as inputs X. For known y, a regressor is fitted on (X, y). The regressor is then used to predict the missing y values. This is done iteratively for each feature, and then repeated for max_iter imputation rounds. The final imputation round results are then returned. For the modelling phase, I chose to try different models with their default parameters in order to choose the most accurate one. Some of the models were linear regression, KNeighbors regressor, Random Forest, and Gradient Boosting regressor. Then I used GridSearchCV –a technique to search through the best parameter values from the given set of the grid of parameters– to optimize the chosen model.
Insights on the data and the tools used
The provided data was anonymized. In that case, it is hard to understand the correlation between the variables. Compared to traditional ML competitions where the data has context, I did not try to understand the meaning of the columns but instead treat them simply as numbers. As a result, I focused more on the modelling part.
For my development environment, I used Visual Studio Code. As for the libraries, I used Pandas for the data manipulation and Scikit-learn for the modelling part.
Competition Takeaway
Time-management is an essential skill to succeed in these competitions. From challenge to challenge, I improved slowly but surely this ability to judge whether something is worth trying or not. Since BUSINESS & AI’s competition was set in a relatively short time frame, I was able to further challenge myself and improve more. I am grateful for participating and getting opportunities to train.
Advice to get started in data science
The biggest advice I could give to beginners is to not be afraid to try and make mistakes. Take any opportunity one could find to learn whether it is a challenge or a hackathon. Participate as much as possible without thinking that it is something beyond one’s level. Do not hesitate to reach out to others in the field for help and to get involved as much as possible.