For this project, machine learning algorithms, natural language processing, and various generative models were used to explore human cognition through storytelling. The analysis explored whether the similarity score of the retold or imagined story would have predictive properties with its corresponding recalled story and whether it is possible to predict if a story is imagined or recalled. The dataset used contains real and imagined stories, a summary of the original story, and a story recall that was observed after some time. The team determined that the dataset is not representative of the population with a bias towards white, younger respondents. The analysis was limited, but the highest accuracy score was achieved by using logistic regression.
Introduction & Background
Storytelling is an essential part of how people connect. Having a better understanding of how cognitive recall works while telling a story can help in situations where emotions are running high and memory is unreliable. It is important to think about under what conditions will someone ever have to say: “tell me what happened” as well as the implications of their response.
Various departments within an organization can use this analysis when handling discrimination and sexual harassment cases. This is due to the implied characteristic that most imagined stories can indicate a significant level of categorical deception. Understanding how memory affects our storytelling abilities is an important step in recognizing trauma and how the mind processes information over time.
Furthermore, this application utilizes high-level parsing formulations which contribute to processing natural language interaction with modern technologies such as Siri (Apple) and Alexa (Amazon). In doing so, formulations can be identified in speech patterns with respect to abstract human communication (e.g. sarcasm or bias).
To investigate this relationship between human memory and storytelling, the team utilized data provided by Microsoft Open Source in association with a research paper found here. Most of the data is unstructured. The team reviewed the data to link story-identification codes together and performed text analysis methods like the Natural Language ToolKit (NLTK) library and sklearn algorithms as part of the preprocessing steps. This is where the team discovered some bias within the dataset. The team proceeded to extract insights by exploring summary statistics, creating decision trees, and running a logistic regression calculation. Finally, the team considered how this analysis could benefit society and our understanding of the intersection between memory, storytelling, and business analytics. This analysis can provide more insight towards topics like cognition, business assessments, and general ethical dilemmas.
In comparison to the research paper that was published with this dataset, the team decided to look at the data from a slightly different perspective. While the former researchers generated functions to design ‘narrativization of recall’, ‘concrete event frequency’, and ‘common sense creation’ parameters, the focus of this analysis was to utilize these parameters to look at the demographic and annotator self inputs to distinguish certain key signals in the data. To highlight this, the concreteness attribute can be used to predict the sincerity of the story’s origin.
The dataset is accessible through Microsoft’s Research Open Data site. It includes 6,854 stories in English. Variables include gender, race, age, and how the participant felt throughout the process in regards to the story creation and recall process.
Amazon Turks were used to fill out a survey that collected the data we are interested in. They were then asked to recall the story within months of writing the story. All data was collected in written format, so the specific words used in the summary of the story, the original story, and the recalled story are provided in the dataset.
Data Pre-Processing & Exploration
Because the data is largely unstructured, the data format was changed through a lot of pre-processing steps before it can be analyzed. The first step was to match the retold and imagined stories back to the original recalled story given by the turk. The dataset included a pair identification code, which allowed observations to be matched.
The story paragraphs were processed by using the NLTK library. NLTK can separate paragraphs into sentences and sentences into words in a process called tokenization. NLTK was used to also remove stop words because they appear often in text, but they do not contribute to the overall meaning of a sentence (e.g. the, a, of). Verbs, nouns, and adjectives were also lemmatized. Words like “play” and “playing” are similar words, but the analysis would consider these words as different unless the data was lemmatized. Therefore, using the root version of a word would improve the similarity of paragraphs that may have been told in a different tense. These pre-processing steps will enable the creation of derived features from the data set, including similarity scores between the original story and the retold/imagined story and overall sentiment scores of a story.
The dataset also included demographic information that was explored. Race and gender were reported by these individuals. Gender was largely dominated by the binary genders “male” and “female” in every 5-year range for age. However, the representation of non-binary genders was captured in younger age groups but still lacked representation overall.
Race is heavily represented by individuals who considered themselves white in every age range. It is also interesting to see that the dataset is skewed to the left when considering age. Most of the individuals who responded are in the 25–35 age range. The results might show overestimations because it is assumed that memory retention and accuracy is negatively correlated with age.
Learning & Modeling
The first question that was explored was whether the similarity score of a retold/imagined story could be predicted with its corresponding recalled story. Using similarity score as a proxy for memory accuracy, this process revealed vital features that are most predictive of similarity score. These features are important in determining which factors in the dataset affect memory the most.
The models considered include multiple linear regression, a lasso-regularized linear model, and a decision tree regressor. The linear regression was considered as a starting point to the analysis since it would not take a lot of time to implement and could help guide the analysis. The lasso and decision tree regressor were considered because of their built-in ability to perform feature selection, which further developed insights on what impacts a similarity score the most. The mean-squared errors were very small, so the linearity of the model was further analyzed, and it was discovered that the data was not very linear. This meant that predictions would not be very accurate.
The second question that was explored was whether it is possible to mimic what the research paper included with the data set was trying to do and predict if a story is imagined or recalled based on this database. Most of the features provided weren’t helpful, so some new features were developed for this analysis. A total of seven features were used: the sentiment of the story with respect to the summary, the sentiment of the story with respect to the main event, total word count of stories, total word count of “cleaned” stories, the cosine similarity of the story with respect to the summary, the cosine similarity of the story with respect to the main event, and the average concreteness of the words.
Sentiment analysis was based on the Summary and Main Event with respect to the story because the recalled story would have a larger absolute sentiment value than the imagined one. Concreteness evaluates the degree to which the concept denoted by a word refers to a recognizable entity, and it was introduced by Dr. Brysbaert and his team in 2014. The team provided a database with thirty-seven thousand English words and three thousand two-word expressions of concreteness with mean and standard deviation scores. For reference, there are 280 words or expressions ranking with the highest mean concreteness score of five in the database (e.g. sled and peacock), and the words with the lowest scores are “eh” and “essentialness” with a mean score of 1.04.
When looking at the linear regressions and decision tree regressor, we found certain variables appeared throughout these models. They include the log amount of time (in days) since the event was told to the person who would go on to either retell the story or imagine a story. Draining, an indicator variable that asked how tired a person was while performing the task, also appeared. These results made intuitive sense but they were not the most insightful.
To predict the type of story, several binary classifier models were run because finding the best accuracy score would be ideal. The accuracy score for the logistic regression is the highest of all the binomial classifiers with a 62% accuracy score. Among all the features, the decision tree classifier indicated that the sentiment analysis for the summary and the total word count are the two most important features. By referencing the decision boundary graph below, it is evident that the total word count actually plays a large part within the classification.
The decision tree is also another simple way of showing the importance of the features. It is difficult to compare many features; however, if the similarity is large and the total word count is large, then there is a high possibility that it is an imagined story.
The logistic regression provides the best results, which is even better than the multi-layer perceptron classifier. It is predicted that the neural net would give better results.
The team explored human memory through storytelling. The dataset provides real and imagined stories, a summary of the original story, and a story recall that was observed after some time. Using text analysis methods and predictive modeling methods, the team determined that the logistic regression had the highest accuracy results.
Because this group did not have a strong linguistic background, team members focused too much on demographic information like race and gender. The dataset was not an accurate representation of the population. There was a disproportionate amount of white respondents, and most people categorized themselves as either male or female. Because memory is affected by age, having a dataset with fewer older participants is not ideal. There is bias in the dataset in this regard. Furthermore, The team saw that similarity score wasn’t the best approach and another feature would be better and we considered neural networks moving forward. The team should not have attempted creating decision trees using unstructured data.
There were a few weaknesses within the current dataset that limited the analysis. For example, it would be helpful to have more features/variables included in the dataset. It is suggested that measuring how long someone took to brainstorm a story idea and how long it took to write the story could indicate differences between real and imagined stories. Does it take longer for someone to write a fictional story? It would also be helpful to have multiple people read the stories and try to recall the story later. Having multiple people try to recall one story would help data scientists measure memory accuracy. There is a difference between one person writing the story and recalling it later in comparison to one person writing the story and having other people try to recall what they read.
To deepen the analysis, future exploration should focus on the content of the stories written and recalled. Do imagined stories use simpler words? Are the sentences within imagined stories shorter or less detailed? Do real stories use more adjectives? Furthermore, it would be interesting to explore whether the models can predict who wrote a specific story based on their writing style.