The Poetics of Chicanx Student Activists
By Brandon Daniels (Communication) and Joshua Ladd (Computer Science)
Our goal in undertaking this project was to study the creative writing of Chicano students at varying degrees of scale with the use of computational methods. Initially, this project began from an interest in the methods of text mining. Both authors were interested in the significance of the term “Chicano,” and how it was articulated in various ways throughout the El Diario de la Gente corpus. Our growing familiarity with our data set led us on a path of historical research for the Chicanx social movement in the United States. We wanted to understand how certain discourses of Chicanx identity had not only impacted the student authors at CU Boulder, but the entire Chicanx student movement across the Southwestern United States.
We found that there were a number of events and publications that were influential for the Chicanx student movement. In 1969, Rodolfo ‘Corky’ Gonzáles’s organization Crusade for Justice hosted the first National Chicano Youth Liberation Conference. At this week long conference, hundreds of Chicanx students voted to endorse the Plan Espiritual de Aztlán, a manifesto that worked to construct a positive self-identity for Chicanx people and theorized a radical program for social change. Furthermore, Corky Gonzáles’s epic poem “Yo Soy Joaquin” was well circulated by this time. According to historical accounts of the Chicanx social movement, this conference and two publications had an extraordinary influence on the development of Chicanx identity and the movement. As literary critic Mary Louise Pratt (1993) argued, “Like the Plan de Aztla’n, “Yo soy Joaquin” had the effect of anchoring Chicanx identity in a series of cultural coordinates that included land, agriculture, spirituality, and links to the indigenous” (p. 868).
We aimed to use computational text mining methods to test the validity of this influence thesis; and in the process, generate quantitative evidence for shared connections (linguistically, stylistically, or thematically) between the “Yo Soy Joaquin” poem and the creative writing dataset from Chicanx students. Our work furthers a conversation within the digital humanities about computational poetics, but it also gave us the ability to study at scale the poetry and short stories of Chicanx students. We found that while there is evidence of similarity between Corky’s writing and El Diario de la Gente creative work, natural language processing is severely limited in its ability to interpret multilingual code switching – which introduced flaws into our method. We end this essay with a call for greater research and development of rich multilingual corpora and natural language processing models.
Our work builds upon the research of scholars within the digital humanities working in the related subfields of digital literary studies and computational poetics. In the development of our method for studying El Dioario de la Gente, we were inspired by David Kaplan and David Blei’s (2007) computational study of style in American Poetry. Since the influence of Corky’s poem should not be limited to the content material discussed, we wanted to understand if the stylistic elements of his writing could be discovered in the corpus. Kaplan and Blei utilized text miming to record the frequencies of parts of speech (POS) in different poems. They argued that this data point serves to reflect the syntactic structure used by poets. Furthermore, Justine Kao and Dan Jurafsky (2012) developed computational methods for the study of “style, affect, and imagery” in poetry. These authors inspired us to adopt a measure of vocabulary richness (type-token ratio) and word frequency to measure the difference between creative works and articles in the El Diario de la Gente corpus.
Our approach to the quantitative study of poetics in Chicano student writings is inspired by the movements within the Digital Humanities that attend to the study of racial difference. As explored in our initial design of this research project, the purpose of tracing the different articulations of Chicano within the corpus was to explore the utility in computational and quantitative methods of studying the crafting of self-identity in Chicanx student writing. As Richard Jean So, Hoyt Long, and Yuancheng Zhu (2019) argue,
Our overall aim in this article, then, is to implement a computational study of race that is critical, reflexive, and interpretative, one that acknowledges the necessary limits of quantitative method (in particular its categorical logic) while exploring its affordances for thinking about racial difference at scale (its patterns and regularities) (n.p.).
So, while we were enamored by the prospect of studying racial difference at scale, we were hesitant to draw conclusions that may reify or essentialize oppressive identity categories. One consideration we made was about the danger of lacking an intersectional approach to the study of the Chicanx movement, for women played a large role in this movement. Unfortunately, poems like “Yo Soy Joaquin” constructed the Chicanx identity through the normative male subject. This required us to pay close attention to gender in our analysis and to search the texts for all variants of the gendered proper noun: Chicana and Chicanas.
Our data set consists of the documents contained in the El Diario corpus as well as the poem “Yo Soy Joaquin” and a selection of speeches by Gonzalez. In the El Diario dataset, there are four main classes of documents. There are 950 ‘articles’, 104 ‘creatives’, 367 ‘notices’, and 256 ‘adverts’. Some articles needed to be discarded due to formatting and encoding errors that prevented them from being useful in a large-scale computation. Our smaller corpus of Corky Gonzales writings was not divided by category, and we had passages from 7 speeches and the poem Yo Soy Joaquin divided into 12 documents. We curated the small dataset of speeches in order to compensate for the lack of documents that we had available for comparison between the writing styles of the students within the Chicanx movement and Gonzalez. In our analyses, we largely focused on the documents from our corpus in the “creative” category, as one of our hypotheses was that the influence of Gonzalez’s writings and speeches would be more evident within that subset of student works.
We used a suite of methods from the natural language processing community to attempt to tease out a relationship between El Diario’s creative works and Yo Soy Joaquin. Our methods are all intended to detect if there are meaningful differences or similarities in the way that language is used between two document types (articles, creatives, works by Gonzalez, etc). We offer a brief description of each method below.
POS (Part of Speech) Tag Frequency uses computer models to generate grammatical assignments for each word in a document, and then compares the relative frequency of each tag between classes of documents. For this method to be effective, there is assumed to be a functional difference in the POS distributions for each document type (creative, article, works written by Gonzalez), and is sometimes used in computational poetry analysis due to the difference in how language is used in poetic works as compared to more prosaic writings.
Vocabulary Richness is a simple metric that measures the ratio between ‘types’ and ‘tokens’ in the dataset. Essentially, it measures to the number of unique words used divided by the total number of words used. Stopwords are removed from this analysis, as they inflate the number of words used while not contributing heavily to the number of unique words.
Term-Frequency Inverse-Document-Frequency (Tf-idf) is a method for converting individual text documents into numerical forms. Rather than just counting words like two bag-of-words methods above, it considers how frequent those words are in the entire dataset as well. This is meant to prevent words that are common in all documents from seeming like they are uniquely important to a particular document of interest. This method is often called a ‘vector-space’ model of text, meaning that it converts each document into a vector, or list of numerical attributes. This is useful for machine learning and visualization techniques.
Principal Component Analysis (PCA) was used in conjunction with Tf-idf to generate figures showing the linguistic relationships learned between documents. It is a method for reducing the dimensionality of data. What this means is that because we have a vector from Tfidf that is very long, we need to compress it down into two dimensions so that we can see it visually to understand what trends are present.
A known weak point and open question in the literature of natural language processing is the computational understanding of figurative and rhetorical speech. For our inquiry, this became a mildly confounding variable, as our central question revolved around measuring the degree to which a figurative and rhetorical text exerted influence on the writing style and thematic content of a student newspaper.
Our first approach, POS tag frequency, yielded by far the least interesting results. As you can see in Figure 1, there are almost no differences between Corky Gonzalez’s uses of grammar and those of El Diario. Even considering the creatives and articles within El Diario as separate, there are no meaningful differences or similarities to glean. We propose two potential causes of this. The first cause may be that computational POS tagging schemes are not well suited to annotating mixed language poetic works. It is possible that some of the successful analyses that use POS frequency are more hand tailored, using manual POS tagging before computionally comparing the frequencies. The second potential source of error is that we added transcribed speeches of Gonzalez’s to the poem in order to gain more examples of his style. It is possible that the difference in use of language between spoken and written works led to a muddling of the statistical similarities between Gonzalez’s works and El Diario’s.
Figure 1: POS Tag Frequency
Vocabulary richness was far more interesting as a metric than POS frequency. The average richness for documents of each type within our corpus are shown in Figure 2. One of the immediately noticeable elements of this figure is that creative works and Gonzalez’s writings are by far the most vocabulary rich types of documents that we analyzed. An additional point of interest is that articles show the least richness, hinting that there is a specific tonal character of these works that tends to lead them towards using similar words. This is likely due to the nature of creatives and articles, as creatives in our dataset are largely poems and multilingual works, which tend to have more varied uses of words than articles, which are likely focused on news about current events. This result supports our hypothesis that the creatives draw the most influence from Gonzalez, but it could also just indicate that both the creative works and Gonzalez use a large vocabulary, and there may not be any real similarity between those documents.
Figure 2: Vocabulary Richness by Documents Type
Our last arm of analysis was to use Tf-idf and PCA to see spatially the semantic relationships between various classes of documents. Using Tf-idf, we created vectors for each document, and then we reduced their dimensionality using PCA to display documents as a point on a graph. As is shown in Figure 3, this allows us to see relationships between categories of text. We have included only articles, creatives, and works by Gonzalez to see which types of documents fall closest to Corky Gonzalez’s writings. There are a few visual features that immediately pop out. First is that Corky Gonzalez’s works are all clustered an a relatively small space on the figure, whereas the articles are spread throughout the image. This potentially indicates that a wider spread of topics was discussed in articles, and that Gonzalez’s works are more focused in theme. The second visual feature is that the creatives are all piled in the same cluster as Gonzalez’s speeches and Yo Soy Joaquin, indicating that there are more similarities between those two categories than either of them shares with articles. This is promising for our analysis, but it does not dive into the nature of these relationships and could be merely representative of the genre similarities between them rather than content similarities.
Figure 3: Word Similarities of El Diario Works and Gonzalez Snippets
As mentioned before, natural language processing is severely limited in its ability to model and interpret multilingual corpora. Python’s Natural Language Toolkit library is widely used for text mining projects, but its models are trained upon English language news articles. For example, one of our metrics – POS frequency – is trained by decades-old hand-coded linguistic-analyses of articles written in the Wall Street Journal.
By engaging in the study of a multilingual corpus, we were forced to uncover the linguistic biases that plague natural language processing. In doing so, we were reminded of Tara McPherson’s advice: “We must remember that computers are themselves encoders of culture” (n.p.). We call for DH scholars interested in social justice to expand the available corpora of multilingual texts, so that our efforts in natural language processing can be used for studying the ordinary and commonplace uses of language – not simply the discourses of Anglosphere journalists.
Kao, J., & Jurafsky, D. (2012, June). A computational analysis of style, affect, and imagery in contemporary poetry. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature (pp. 8-17).
Kaplan, D. M., & Blei, D. M. (2007, October). A computational approach to style in American poetry. In Seventh IEEE International Conference on Data Mining (ICDM 2007) (pp. 553-558). IEEE.
McPherson, T. (2012). Why are the digital humanities so white? Or thinking the histories of race and computation. Debates in the Digital Humanities, 139, 160.
Pratt, M. L. (1993). ” Yo soy la Malinche”: Chicana writers and the poetics of ethnonationalism. Callaloo, 16 (4), 859-873.
So, R. J., Long, H., & Zhu, Y. (2019). Race, Writing, and Computation: Racial Difference and the US Novel, 1880-2000. Journal of Cultural Analytics.
Our code: https://github.com/joshladd/ElDiario.git