User-driven Data and Generative Models
User-driven Data and Generative Models
@ CMU-SoA 48-770 Learning Matters
During the Spring semester of 2021, students of 48-770: Learning Matters participated in a two-week-long toolmaking practice which was integrated with one of the technical modules of the course. They utilized various mediums to collect their own handwriting samples and trained a machine learning model to generate a bespoken handwriting typeface. Instead of the standard practice of using off-the-shelf databases and pure quantitative metrics to train and evaluate their models, participants relied on the data generated by themselves, subjective measures, and personal preferences to make their tools.
The study serves as pilot to investigate the potentials of bespoken data collection methods, interactive data curating tools, and conditional generative models—Conditional Variational AutoEncoder (C-VAE) in this case—in the process of machine learning-based creative toolmaking. The scope of the study was intentionally narrowed into a subset of the question that this research aims at. Notably, instead of selecting the participant from a poll of expert users with close-to-no machine learning experience, participants of this study had prior exposure to programming and were recently introduced to fundamental concepts and methods in machine learning.
The data pipeline was designed based a unified workflow that all participants could execute and reproduce using a series of simple and accessible tools, such a printer and a cell-phone camera. It was also flexible enough to accommodate more advanced means, i.e., tablets, styluses, and digitizers.
Participants were provided with an 11-page set of grids with 36 cells for each letter of English alphabet, with a total of 1872 data entry slots (Figure 4, left). Participants should fill out the charts using a desired writing tool—pen, pencil, marker (Figure 4, right)—and digitalize it using either a scanner or a digital camera. To facilitate this process, participants had access to a data pre-processing CoLab notebook with the necessary functions, as well as a detailed video demonstration. Alternatively, participants could use a digital medium with stylus support to directly generate the data in digital format. In both cases, the same CoLab notebook could be used to slice the input images, extract each letter, remove the boundaries, and format the results as NumPy arrays with pre-defined shape and dimensions.
Participants had access to a data visualization and curation dashboard to help them curate a training dataset from over 28000 shared samples (Figure 7). The dashboard was designed around a series of data visualization plots. To create these plots, the data samples were processed by t-SNE algorithm  to reduce their dimensions from to only two. The process of mapping complex datasets from a high-dimensional space to a very low dimensional one is a common practice in data visualization. When used effectively, it helps users comprehend the distribution of data, hidden patterns, and relationships between the samples which otherwise almost impossible. The resulted two-dimensional mapping was a distribution of samples based on their visual features. A color scheme, representing the label of each sample—i.e., a, b, X, Q, …—was also applied to each plot. Users could hover over any of the two scatter plots to visually inspect them one by one or review them in bulk using the selection tools.
The Conditional Variational AutoEncoder that was used in this study follows the basic architecture of VAEs, and encoder, a variational sampling, and a decoder, stacked one after the other. Both encoder and decoder models use a cascade of modular blocks. In the encoder model, the module is made of a convolutional 2d layer, followed by max pooling, batch normalization, and leaky ReLU activation function. In the decoder model, the block uses a convolutional transpose to up sample the input, followed by batch normalization and leaky ReLU. The only exceptions are the first module of encoder—that skips the max pooling—and the last module of the decoder—substitutes the leaky ReLU with sigmoid to keep the results in 0.0 to 1.0 range.
While the architecture of the model was fixed for all the participants, they had the option to adjust a few training parameters, most notably the number of training epochs.
Participants used the CoLab notebook to train the model on the Google’s cloud server. The training section was designed to represent some high-level control using hyper parameters as well as series of visualizations to help with qualitative observations and quantitative evaluations. Most notably, participants could see the model’s performance through a three-row plot. The first row of the plot shows a random set of samples from the validation dataset. The second row shows the same set of samples passing through the model, encoded to the latent space, and then decoded to get reconstructed. The third row shows the differences between the first and second row; yellow regions indicate the most similarity while red spots highlight the largest discrepancies. This combination of plots visualizes the model’s performance in an intuitive way. Participants can make a qualitative evaluation based on the fuzziness of results on the second row and the smaller regions covered with red spots in the third row. Sharper results in the second row and fewer red regions in the third row were indicative of a better performance. The combination of three-row plot and the classic loss-per-epoch plot, provide the participants with two means of supervising the training process.
As the sampling method became more likely to produce legit results, the overall process was less cumbersome to finish. Meanwhile, to encourage the participants to play with the sampling parameters—and not leaving the setting on default values and skip through each glyph—the sampling parameters were intentionally defaulted to produce borderline results. The results submitted after these changes were noticeably improved compared with the ones presented previously.
Eventually, they can use the interactive rendering widgets to convert an input text to a handwritten form while adjusting the spacing between the letters as well as lines. The generated typeface is saved as series of 2D images in NumPy array of dimension were 52 reflects the number of glyphs in the typeface, is the number of samples for each glyph, and defines the number of pixels in each glyph width and height (Figure 14). The final NumPy array is then saved to be used later