Congratulations ! Let’s start with a brief introduction of yourself:
My name is Fedi GHANMI, I am 22 years old, I study in Tunis Businss School (TBS). I have majored in Business Analytics and I have spent the last 3 months (from June 2022 to August 2022) interning at BUSINESS & AI as a data science intern.
Q: Can you summarize the competition for readers who are unfamiliar?
A: The challenge was to build the most accurate model that could predict a continuous target variable from a dataset with a variety of data types. Since the dataset was anonymized, no information about its origin, the company it represented, or the prediction we were making is available.
Q: What was your background prior to entering this challenge?
A: Prior to this challenge, I had a good understanding of machine learning algorithms and was comfortable using Python’s scientific libraries like scikit-learn and pandas to manipulate dataframes.
Q: What made you decide to enter this competition?
A: I entered this competition in an effort to outperform my performance in the previous one. I see competitions as a way to streamline the data science learning process. You will have the chance to put what you learn and understand into practice in a competitive environment, which will lead to your becoming more technically and critically skilled.
Q: How did you approach the problem and structure your solution?
A: I started my method by removing any observations from the train set that had fewer than 1% nan values. I made the decision to eliminate variables with more than 60% nan values because they won’t add much information to the model. I therefore made the choice to keep the variables that were the most informative. I then utilized the constant approach to impute the missing variables that will exist during the predict time in order to better manage the incoming predictions. One of the key reasons my prediction error fell was because all of the preceding transformations were done to minimize data noise in order to improve the model’s performance. A random forest regressor was used as the final estimator after putting all of these processes into a pipeline that was fitted on our data.
Q. What was your most important insight into the data?
A: My main realization was that since the data had been anonymized, I should concentrate more on the modeling part and less on the business niche.
Q. Which tools did you use?
A: For my packages, I used pandas and scikit-learn, and as my development environment, I used PyCharm Community Edition.
Q. How did you spend your time on this competition?
A: The majority of the five-hour competition was devoted to feature engineering and data preprocessing tests. After that, it took me about 30 minutes to manually tune the parameters, and the last 30 minutes were spent organizing and refactoring my code in preparation for submitting it as organized source code. (edited)
Q. What have you taken away from this competition?
A: The main benefit I gained from this competition was the improvement of useful peer comparison. I used to make meaningless comparisons between myself and my fellow peers. I feel like now I can pick up knowledge from my peers and rivals, discuss how we all approach challenges differently, and use that knowledge to advance.
Q. Do you have any advice for those just getting started in data science?
A: I want to give the beginning data scientists the following piece of advice: focus on learning one concept at a time. Being confronted with a ton of new information can be overwhelming for a beginner, but structuring what needs to be learned and applying it frequently is a big plus.