1.1 Initiative Summary:
Businesses looking to operationalize LLM-supported applications will benefit from using cloud-based (private or public) LLM “as a service” (LLMaaS) platforms for governance and scalability. Among many features, data governance (primarily for unstructured text) will be a critical offering of these platforms, including that from Blattner Technologies. This initiative will focus on contributing to the development of an extensible end-to-end data governance framework, including external data ingestion, parallelized data preparation and analytics, and versioning.
1.2 Desired Outcomes
- Prototype innovative workflow-based capabilities for preparing unstructured text in a scalable, traceable, and intuitive manner for downstream LLM-related tasks, such as training and fine-tuning.
- Presentation to broader company highlighting approach, challenges, solutions, and significant insights stemming from the effort.
1.3 Core Skills Required
- Required skills:
o Fundamental LLM knowledge (e.g., prompt engineering, fine-tuning)
o NLP-based development (e.g., tokenization, embedding generation, and operations, textfication)
o Python development
o Experience with parallel distributed systems and/or parallel computation libraries such as Spark, Dask, or RAPIDS
- Optional/preferable skills:
o Kubeflow
o Vector databases
o Experience with NLP libraries such as spaCy and gensim
1.4 Estimated Effort
- Full-time summer internship (40 hours/week)
- Depending on progress, work may extend to part-time during the Fall semester (e.g., 10 hours/week)
1.5 Additional Information
This is a remote internship opportunity, working with summer mentors and reporting to the Chief Product Officer of BOSS AI. The group has a deep focus on implementing LLMs “as a service” (LLMaaS) and team members have a range of skills from enterprise software engineering, NLP, ML, and UX. You can expect to gain valuable experience in operationalizing LLMs and addressing critical security needs for all language models.