Data Science Institute

Data Science Fellows help create language learning tool

Data and data science techniques are increasingly relevant in a wide range of disciplines.

Sam Chris

More and more, faculty want to incorporate data science into their teaching—for the first time, or in new ways.  However, faculty in disciplines that are new to the world of data science often lack the training or expertise to effectively harness the data or methodologies available. Brown’s Data Science Fellows program helps them to do that by pairing faculty members with Brown undergraduates who have data science skills and experience to bring to the project. A project from the first cohort of Data Science Fellows is a perfect example of this. 

The inaugural class of Data Science Fellows in Spring 2020 paired eleven students with ten faculty members to tackle projects in fields ranging from language studies to environmental science to the theater arts. One of these was a project where students Samuel Murk Caya (Class of 2022.5, left photo) and Christopher Sarli (Class of 2022, right photo) worked with Dr. Jane Sokolosky, Director of the Center for Language Studies. Dr. Sokolosky wanted to automate the identification of reading assignments to correspond with her instructional goals.  

Samuel is a concentrator in Computer Science and Economics and a student in beginning German, and Christopher is a concentrator in Urban Studies and Computer Science; together, they brought the technical expertise needed to build what Dr. Sokolosky had in mind. Over the course of the semester, the trio crafted and curated an online catalog of nearly eighteen hundred digitized German texts that could be sifted according to metadata and grammatical constructions. The catalog now serves as a teaching and learning tool for the German Department on campus. 

An example of the code identifies the grammatical aspects for a search over a text catalog (see above). Once a user has selected search criteria, the digital catalog returns texts in the database that contain the highest frequency of those criteria. A selection of contextualized matches is included in each result, allowing the user to quickly gauge the usefulness of each returned text. Major milestones in the project included the successful conversion of the e-book files from Project Gutenberg into a workable format for the Python programming language, the finalization of the German grammar constructions of interest, and the deployment of the catalog onto GitHub.  An example of the possible ‘matches’ for the curricular content is below:

“My role in the team was unique,” said Samuel. “Dr. Sokolosky had expert knowledge of the German language but was not familiar with programming. Chris had a deep and broad understanding of programming languages, but he had never studied German before. In the beginning, I was a translator of sorts, since I studied both German and computer science.” In their weekly meetings, the three readily established common ground and a shared language developed from their different academic backgrounds to discuss the project. Their discussions concerned everything from pondering how their project could benefit the German Department on campus to focused conversations about the intuitiveness of the user interface of the catalog and how to strike a balance between false positives and negatives in search results.


Midway through the semester, the onset of the pandemic moved their work online. Thankfully, the project was digital in nature, so their work was largely uninterrupted during the transition to remote collaboration. The only noticeable change was the shift from in-person meetings in the SciLi to virtual meetings on Zoom. Work continued, and the bulk of the project was finished as the semester came to a close, with some loose ends tied up over the summer. Samuel, now a teaching assistant for the current cohort of Data Science Fellows, added,  “I really cannot stop gushing about this class to people even now, and I am thrilled to see what projects these new Fellows create this semester!”

Read more about the project