Hello World

“Model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. When you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude”, it’s not the model weights that you are referring to. It’s the dataset.” jbetker @ openai

Machine learning models are only as good as the data they’re trained on. Our mission is simple: create the best training sets to build the world’s best models.

Everyone agrees that data quality is critical for building state-of-the-art models. But how do you build a great training dataset?

At Hyperparam we believe that it is impossible to do good data science without being intimately familiar with your training data. But where do you even start? Modern LLM depend on terabytes of unstructured text data. Most data tools cannot handle this scale of data interactively, or require sampling to show only a tiny slice of your data.

If you want to build a highly interactive tool for working with data, the browser is the only tool for building modern UIs. The question is: can the browser handle massive text datasets interactively? Yes. By leveraging modern web APIs, and with an obsessive focus on speed and architecture, we are building the world’s most scalable UI for data.

Building a UI for machine learning data is a necessary first step, but does not solve the problem of finding good vs bad quality data within massive datasets. To find the “needle in a haystack” we use machine learning models to reflect back on their own training set. Everyone evaluates models – we evaluate data.

Combine this new scalable UI with methods for evaluating ML data, and you have a powerful engine for iteratively developing the world’s best quality models.