Machine learning could finally crack the 4,000-year-old Indus script

In 1872 a British general named Alexander Cunningham, excavating an area in what was then British-controlled northern India, came across something peculiar. Buried in some ruins, he uncovered a small, one inch by one inch square piece of what he described as smooth, black, unpolished stone engraved with strange symbols — lines, interlocking ovals, something resembling a fish — and what looked like a bull etched underneath. The general, not recognizing the symbols and finding the bull to be unlike other Indian animals, assumed the artifact wasn’t Indian at all but some misplaced foreign token. The stone, along with similar ones found over the next few years, ended up in the British Museum. In the 1920s many more of these artifacts, by then known as seals, were found and identified as evidence of a 4,000-year-old culture now known as the Indus Valley Civilization, the oldest known Indian civilization to date.

Since then, thousands more of these tiny seals have been uncovered. Most of them feature one line of symbols at the top with a picture, usually of an animal, carved below. The animals pictured include bulls, rhinoceros, elephants, and puzzlingly, unicorns. They’ve been found in a swath of territory that covers present-day India and Pakistan and along trade routes, with seals being found as far as present-day Iraq. And the symbols, which range from geometric designs to representations of fish or jars, have also been found on signs, tablets, copper plates, tools, and pottery.

Though we now have thousands of examples of these symbols, we have very little idea what they mean. Over a century after Cunningham’s discovery, the seals remain undeciphered, their messages lost to us. Are they the letters of an ancient language? Or are they just religious, familial, or political symbols? Those hotly contested questions have sparked infighting among scholars and exacerbated cultural rivalries over who can claim the script as their heritage. But new work from researchers using sophisticated algorithms, machine learning, and even cognitive science are finally helping push us to the edge of cracking the Indus script.

 Steatite seal with humped bull, Indus Valley, Mohenjo-Daro, 2500–2000 BC.Photo by CM Dixon/Print Collector/Getty Images

Spanning from 2600 to 1900 BC, the Indus Valley Civilization was larger than the Egyptian and Mesopotamian civilizations, encompassing over 1 million square kilometers that stretched over present-day India and Pakistan. It featured sophisticated infrastructure including advanced water management and drainage systems, well-organized cities with street planning, and some of the first known toilets. The Indus people also hosted a massive trade network, traveling as far as the Persian Gulf. In fact, the first traces of the Indus people were rediscovered in the mid-19th century, when construction workers tasked with connecting two cities in modern-day Pakistan came across a massive supply of bricks among some old ruins. The workers used them to construct nearly 100 miles of railroad tracks. It would be some time before archaeologists realized those bricks came from the Indus Valley Civilization.

Archeological digs revealed precious little: oddly and rather inconsistently with other Bronze Age civilizations, there is no evidence of powerful rulers or religious icons. We haven’t found any palaces or large statues, nothing like the ziggurats of Mesopotamia or the pyramids in Egypt. And we have very little indication of warfare, save for some excavated spearheads and arrowheads.

In fact, we know almost nothing. “If you were to ask an archaeologist, they would not be able to tell you where the Indus Civilization came from with certainty, or how it ended, or what they were doing when they were around,” says epigrapher Bryan Wells. To us, the Indus Civilization is as mysterious as its symbols.

 This seal comes from the Indus Valley Civilization and is currently housed in the National Museum of New Delhi.Photo by Angelo Hornak / Corbis via Getty Images

The Indus symbols are part of a slowly shrinking list of undeciphered ancient scripts. Scholars are still working on a number of writing systems found all over the world including Linear A and Cretan hieroglyphs (two scripts from ancient Greece), Proto-Elamite (writing from the oldest known Iranian civilization), a handful of Mesoamerican scripts, and the Rongorongo script of Easter Island. Some Neolithic symbols, with no known linguistic descendents, may never be deciphered. Other ancient scripts, such as Linear B, an early precursor to Greek, were eventually deciphered by charting out the signs, figuring out which marked the start of a phrase and which marked the end, how different syllables changed the meaning of a word, and how consonants and vowels were structured within a sentence. It’s not unlike what’s depicted in the alien sci-fi film Arrival — searching for patterns, testing out theories, and lots and lots of trial and error. Though there’s slightly less pressure on Indus scholars than on Arrival’s linguist — people aren’t quite as worried about ancient civilizations as they are about invading aliens.

“It’s often called the most deciphered script because there are around 100 decipherments... but of course nobody likes any of them.”

In the past, much of this work was done by hand. For Linear B, phonetic charts painstakingly eventually led to that language’s decipherment. Similar approaches have been tried with the Indus script as well. In the 1930s, the scholar G.R. Hunter worked out sign clusters that enabled him to figure out some of the structure embedded in the script. But Hunter failed to unlock the code.

“There are several reasons why it’s been too difficult to decipher this script,” says Nisha Yadav, a researcher in the Department of Astronomy and Astrophysics at the Tata Institute of Fundamental Research in Mumbai, India. “The first one is that the texts are really short.” An average artifact only has five symbols. The longest example excavated so far has 17. Such short texts make uncovering the writing’s structure difficult. “Complicating the problem is the fact that we don’t know the underlying language,” says says Rajesh Rao, director of the National Science Foundation’s Center for Sensorimotor Neural Engineering and a professor in the Computer Science and Engineering Department at the University of Washington. “We don’t even know the language family that was spoken by people in that region at that time.” And once the civilization ended, it appears that its culture and writing system did, too. “We do not have any continuing cultural tradition,” says Yadav. Archaeologists have yet to find a multilingual text like the Rosetta Stone, which was key to deciphering Egyptian hieroglyphs.

While our understanding of the Indus script remains minimal, it’s certainly not for lack of trying. “It’s often called the most deciphered script because there are around 100 decipherments,” says Wells, “but of course nobody likes any of them.” Many people have claimed to have cracked the script, often asserting it’s a precursor to a later language, but none of the decodings have held up. “I suppose the wackiest one is a tantric guru who meditated and got in touch with the great beyond, which told him what the script said,” says Wells.

 Steatite seal with Elephant, Indus Valley, Mohenjo-Daro, 2500–2000 BC.Photo by CM Dixon / Print Collector / Getty Images

In order to decipher the Indus script, it’s important to ascertain what we’re looking at — whether the symbols stand for a language, or, like totem poles or coats of arms, just representations of things like family names or gods. “Given the amount of data we have, we cannot make any firm statement regarding the content of the script,” says Yadav. “I think what we’ve done is try to piece together whatever evidence we have to see if it leads us one way or the other,” says Rao. “And I think, at least from the work we’ve done, it seems like it’s more tailed towards the language hypothesis than not.” Most scholars tend to agree.

In 2009, Rao published a study that examined the sequential structure of the Indus script, or how likely it is that particular symbols follow or precede other symbols. In most linguistic systems, words or symbols follow each other in a semi-predictable manner. There are certain dictating sentence structures, but also a fair amount of flexibility. Researchers call this semi-predictability “conditional entropy.” Rao and his colleagues calculated how likely it was that one symbol followed another in an intentional order. “What we were interested in was if we could deduce some statistical regularities or structure,” says Rao, “basically ruling out that these symbols were just juxtapositions of symbols and that there were actually some rules or patterns.”

They compared the conditional entropy of the Indus script to known linguistic systems, like Vedic Sanskrit, and known nonlinguistic systems, like human DNA sequences, and found that the Indus script was much more similar to the linguistic systems. “So, it’s not proof that the symbols are encoding a language but it’s additional evidence hinting that these symbols are not just random juxtapositions of arbitrary symbols,” says Rao, “and they follow patterns that are consistent with the those you would you expect to find if the symbols are encoding language.”

In a subsequent paper,Rao and his colleagues took all of Indus’ known symbols and looked at where they fell within the inscriptions they were found in. This statistical technique, known as a Markov model, was able to pinpoint specifics like which symbols were most likely to begin a text, which were most likely to end it, which symbols were likely to repeat, which symbols often pair together, and which symbols tend to precede or follow a particular symbol. The Markov model is also useful when it comes to incomplete inscriptions. Many artifacts are found damaged, with parts of the inscription missing or unreadable, and a Markov model can help fill in those gaps. “You can try to complete missing symbols based on the statistics of other sequences that are complete,” explains Rao.

Yadav performed a similar analysis using a different type of Markov model known as an n-gram analysis. An example of an n-gram at work is the Google search bar. As you start typing a query the search bar fills in suggestions based on what you’ve typed, and as you type more words the suggestions change to fit the entered text. Yadav and her colleagues looked at both the probability of a particular symbol given the symbol preceding it — a bigram — and the probability of a particular symbol given the two symbols preceding it — a trigram. The resulting patterns suggested the script had a syntax, supporting the idea that it’s linguistic. And like the Markov model, it was also able to fill in probable symbols when inscriptions were missing portions of their text.

These two techniques also uncovered something unexpected: artifacts found in different regions depicted distinctly different symbol sequences. So seals found in what is now Iraq have symbol sequences that tend to be different from others found in India and Pakistan. “This suggests that maybe the same symbols were being used to encode the local language there,” says Rao. “It’s like they were experimenting with the script,” says Yadav. “They were using the same script to write some other language or some other content maybe.”

Providing anthropological and archaeological context to the artifacts we do have would also help further our understanding of the script. Gabriel Recchia, a research associate at the Cambridge Centre for Digital Knowledge at the University of Cambridge, published a method that aimed to do just that. In previous cognitive science studies, he and his colleagues showed that you can estimate the distances between cities by how often they’re mentioned together in writing. This was true for US cities based on their co-occurrences in national newspapers, Middle Eastern and Chinese cities based on Arabic and Chinese texts, and even cities in The Lord of the Rings. Recchia applied that idea to the Indus script, taking symbols from artifacts whose origins were known and using them to predict where artifacts of unknown origin with similar symbols came from. Recchia explains that a version of this method that takes into account much more detailed information could be very useful. “There are significant differences between artifacts that appear in different sublocations within a site and this is what is much more frequently unknown and in many cases, could provide more useful information,” says Recchia. “Was this found in a garbage heap along with a number of other seals or was this something that was imported from elsewhere?”

Meanwhile, Ronojoy Adhikari, a physics professor at The Institute of Mathematical Sciences in Chennai, India, and his research associate Satish Palaniappan are working on a program that can accurately extract symbols from a photo of an Indus artifact. “If an archaeologist goes to an Indus site and finds a new seal, it takes a lot of time for those seals to actually be mapped and added to a database if it’s done manually,” says Palaniappan. “In our case the ultimate aim is just with a photograph of a particular seal to be able to extract out the text regions automatically.” He and Adhikari are working on building an app that archaeologists can bring to a site on a mobile device that will extract new inscriptions instantly.

 UNSPECIFIED - CIRCA 1988: Indus Art - 2500 BC - Stone (steatite) seal of the Indus Valley.Photo By DEA / G. NIMATALLAH / De Agostini / Getty Images

But not everyone agrees that the script is a language. In 2004, a paper written by cultural neurobiologist and comparative historian Steve Farmer, computational theorist Richard Sproat, and philologist Michael Witzel claimed that the Indus script was not a language. The authors even went so far as to offer a $10,000 reward to anyone who finds a lengthy Indus inscription. “To view the Indus symbols as part of an ‘undeciphered script’ isn’t a view anyone outside the highly politicized world of India believes,” Farmer said in an email. After their position on the script was published, Sproat wrote two papers that examined the conditional entropy techniques used by Rao and colleagues as well as similar techniques used by a different group examining Pictish symbols, another ancient writing system. In them, Sproat concludes that the conditional entropy measure isn’t a useful technique. “What does it tell you? It tells you that it’s not completely rigid. It tells you that it’s not completely random. We knew that already. It’s just not informative,” says Sproat. “It doesn’t tell you anything.”

“Just finding structure in a bunch of symbols certainly doesn’t mean you’ve found evidence that those symbols encode language. Even heraldic symbols or astrological signs or strings of Boy Scout medals have structure in them,” says Farmer. In response to Sproat’s papers, both Rao and colleagues and the authors of the Pictish symbols study challenged by Sproat wrote replies that addressed his concerns. Sproat, in turn, wrote a response to the response.

Wells compares fact-checking Farmer to fact-checking Donald Trump

“You would be better off getting medical advice from your garbage man than you would getting ideas about the Indus script from listening to Steve Farmer,” says Wells. “None of the three authors have a degree in archaeology, epigraphy, or anything to do with ancient writing. Their underlying subtext is, ‘We’re all so brilliant and we can’t decipher it so it can’t be writing.’ It’s ludicrous.” Wells compares fact-checking Farmer to fact-checking Donald Trump. “You have to fact-check every single thing he says because it’s mostly wrong.”

And Wells’ beef with Witzel goes all the way back to his PhD dissertation on the Indus script, which Witzel tried to block, according to Wells. Later, while escorting Witzel through India, Wells would show him a PowerPoint presentation entitled “Ten reasons you don’t know what you’re talking about” while in the back of a cab.

One thing Rao and Sproat do agree on is that if the Indus script turns out not to encode a language, that might end up being even more interesting. “We know a lot about ancient civilizations that had writing but we know a lot less about civilizations that lacked writing,” says Sproat. “And if this was some kind of general nonlinguistic system, in a sense, that would be much more interesting than if it was just some kind of script.”

\Rao also thinks there were some nuances of his work that were lost in the debate. “It was an interesting intellectual debate with them and hopefully we’ve now reached a truce,” Rao says, laughing. “Hopefully it’s not going to be a continued lifelong debate, but I think we’ve done our best so far on either side. I’m definitely an optimist and I think we will have a much better understanding of the Indus script one way or the other, linguistic or not.”

Wells would show Witzel a PowerPoint presentation entitled “Ten reasons you don’t know what you’re talking about” while in the back of a cab

Outside of this debate, decipherment progress is also threatened by modern-day politics. Within India, different factions are fighting over whose language and culture descended from the Indus Valley Civilization. There’s the Sanskrit region in the north, the Dravidian region in the south, and those speaking tribal languages in the middle. “They’re arguing that whoever is descended from the people who wrote the Indus script are the true inheritors of India,” says Wells. “So, they’re arguing about this from a modern political point of view. I know people who have received death threats for saying it’s not Sanskrit or saying it’s not Dravidian.” And because the Indus Valley Civilization spanned across present-day India and Pakistan, modern tensions between the two countries bleed into the Indus studies. The photographic collections of the Indus artifacts are published in two separate volumes — one for the artifacts found in India and another for those found in Pakistan.

Another challenge to the script’s decipherment is a classic one: money. Wells believes that until universities and funding agencies make a concerted effort to foster the study of the Indus script, little headway will be made. “It has to be a cooperative effort, it has to be funded, and it has to have a home,” says Wells. For his part in fostering a collaborative effort, Wells is hosting a second annual meeting on the Indus script to take place this March in British Columbia. And if nothing else, that $10,000 reward is on the table for as long as Farmer is alive.

We don’t have a decipherment yet but Rao believes that until we find longer samples or a multilingual text, these statistical strategies are our best bet. And Wells says progress will hinge on cooperation. “I think all of the pieces to decipher the script are there,” he says, “teamwork — interdisciplinary, multigenerational probably — the more we work on it the more progress we make.” Wells and his colleagues have made some progress and plan to present it at the meeting this March. Their findings and other work presented at the meeting should be available to the public in April published as the Proceedings of the Second International Meeting on Indus Epigraphy. In the meantime, anyone working on the script is welcome to contribute to Wells’ collaborative website, which features all of the known symbols and various analytical tools.

When asked about Arrival and whether being able to decipher scripts might one day save the world, Rao laughs. “Well,” he says, “[it] depends on the situation.”

You just earned points!
Login to save points.
Earn your spot on the leaderboard.

You earned Ochen points!

You're on your way to the top of the leaderboard!