For my second independent project, I chose to create a model predicting the lab of origin for engineered DNA, using a dataset provided by DrivenData.org as part of a model-creating contest. I chose this dataset partly because I wanted to challenge myself - simply wrangling the dataset into a usable format was beyond my comprehension when I first started - but mostly because I already had inherent interest in the field of biosecurity and was excited to figure out how I might one day apply my beginner data scientist skills to a challenging problem that actually matters quite a lot.

You see, the technology that has allowed humans to engineer DNA and thereby solve problems has also given us the ability to create new problems, most of which we are unprepared to handle. Part of crafting solutions to the problem of harmful DNA engineering is the attribution problem; or being able to identify the source of engineered DNA if it’s discovered in the wild. As DrivenData puts it, “Reducing anonymity can encourage more responsible behavior within scientific and entrepreneurial communities, without stifling innovation.”

I feel grateful to AltLabs and their partners for sponsoring this contest and its bounty, and a personal thanks to Christine Chung for writing the guide to getting started, which was indispensable to my ability to tackle this project. I had picked out a back up plan for this unit’s project, predicting spread of dengue fever in Latin America, but did not have to change plans only because of Ms. Chung’s writing.

Ultimately, my algorithm could correctly assign a DNA sequence to the ten likeliest labs of origin (of which there were 1314 total) 86% of the time on the public test data and 82% of the time on the private test data. This is a huge improvement over the benchmark top ten accuracy of 33%, and allowed me to place in the top 10% of all the competitors, which I think is remarkable given that I had only been studying data science for 7 weeks and had first generated a virtual environment only a few days before beginning this project. Furthermore, due to my model performing better than the BLAST benchmark accuracy of 75.6%, I was invited to participate in the Innovation Track.

Click here to read the rest of this post at the EA Forum.