Markov Chain name generation

stuff about computer science and programming
Post Reply
User avatar
dendiz
Site Admin
Posts: 114
Joined: Wed Oct 10, 2018 3:48 am

Markov Chain name generation

Post by dendiz » Wed Oct 10, 2018 10:58 pm

Generation of random planet names was a task I tackled recently, and these type of 'random name' generations are great candidates for Markov Chains. Markov chains are probabilistic structures where the next element in the chain depends on the previous elements. By training the model on a set on input data the process creates a chain with each link containing a set of the next probable elements. By starting at a random link in the chain and following until you reach an ending node it's possible to get similar names to the original set.

Here is an example of how this works: Say you have the training input

Code: Select all

[sally sells seashells]
we will partition the input into 2 letter groups (the order of the chain is 2) and record the next letters that follow each partition. $ means end of word.

Code: Select all

(sa) -> l
(al) -> l
(ly) -> $

(se) -> l
(el) -> l
(ll) -> s
(ls) -> $

(se) -> a
(ea) -> s
(as) -> h
(sh) -> e
(he) -> l
(el) -> l
(ll) -> s
(ls) -> $
now we combine the same keys

Code: Select all

(sa) -> (l)
(al) -> (l)
(ll) -> (y)
(ly) -> ($)
(se) -> (l,a)
(el) -> (l,l)
(ll) -> (s,s)
(ls) -> ($,$)
(ea) -> (s)
(as) -> (h)
(sh) -> (e)
(he) -> l
we have built our chain. To generate a new name we pick a random starting point and follow the chain by appending one of the characters in the values until we reach an end or have a long enough string. Let's say we picked ea to start. Here is a sample string that we could end up with

Code: Select all

(ea) + s -> eas
e(as) + h -> eash
ea(sh) + e -> eashe
eas(he) + l -> eashel
...
we would end up with eashells. If we had encountered (se) we would randomly choose between l and a. This small example does not product very original results but that's because our training data was limited. Given a large amount of data and a higher order we would end up with better results. Here is some clojure code to create a markov chain and generate names.

Code: Select all

;the list is truncated for readability
(def words ["Acamar","Achernar","Achird","Acrab","Akrab","Elakrab","Graffias",
            "Acrux","Acubens","Adhafera","Adhara","Ain","Aladfar","Alamak","Alathfar","Alaraph","Albaldah","Albali" ...])
			
;;append ? to the end of each word so we have an easy way of knowing that a word has ended			
(def samples (map  #(str % "?") words))


;;partition the words. we will use the first 2 characters as the key and the last as the value.
;;((\s \a \l) (\a \l \l) (\l \l \y) (\s \e \l) (\e \l \l) (\l \l \s) (\s \e \a) (\e \a \s) (\a \s \h) (\s \h \e) (\h \e \l) (\e \l \l))
(def orders (mapcat #(partition 3 1 %) samples))

;;convert into a list of maps with the keys and values set
;;({"sa" \l} {"al" \l} {"ll" \y} {"se" \l} {"el" \l} {"ll" \s} {"se" \a} {"ea" \s} {"as" \h} {"sh" \e} {"he" \l} {"el" \l})
(def gr (map (fn [[f s t :as o]] {(str f s) t}) orders))

;;convert maps to 2-vectors so we can group them
;;(["sa" \l] ["al" \l] ["ll" \y] ["se" \l] ["el" \l] ["ll" \s] ["se" \a] ["ea" \s] ["as" \h] ["sh" \e] ["he" \l] ["el" \l])
(def mp (map first gr))

;;group by the keys so that we have a list of possible characters for each key
;;{"al" [["al" \l]], "ll" [["ll" \y] ["ll" \s]], "el" [["el" \l] ["el" \l]], "ea" [["ea" \s]], "sa" [["sa" \l]], "se" [["se" \l] ["se" \a]],
;; "sh" [["sh" \e]], "as" [["as" \h]], "he" [["he" \l]]}
(def gb (group-by first mp))

;;clean up the group-by so that only the following characters remain
;;{"ls" (\? \?), "al" (\l), "ll" (\y \s \s), "el" (\l \l), "ly" (\?), "ea" (\s), "sa" (\l), "se" (\l \a), "sh" (\e), "as" (\h), "he" (\l)}
(def markov-map (reduce-kv #(assoc %1 %2 (map second %3)) {} gb))

(defn generate-name
  "create a new name from the markov chain"
  []
  (loop [word ""
    part (subs (rand-nth words) 0 2)]
    (if (= part \?)
      word
      (let [new-word (str word part)
            new-part (rand-nth (markov-map (apply str (take-last 2 new-word))))]
	       (recur new-word new-part)))))
the generation function works like this

Code: Select all

1. our word is empty and the next part is the first 2 letters from a random word in our training data
2. if our part is the end of a word stop and return the word so far
3. other wise append the part to the word, select the a random value from the markov map with the key as the last 2 letters of our word
4. recurse with the new word and new part
Here are some of the names that the generator came up with

Code: Select all

(take 25 (repeatedly generate-name))

("Sape" "Kal" "Torus" "Mei" "Taulurushaukbu" "DuQue" "Cangforishynoriah" "Pavlius" "Gachine" "Arra" "Midea" "Avent" "Alcyoni" "Hava'eloremaias" "Rembus" "Scelsiascidia" "Fimsor" "Drenglossichiropus" "Ohnis" "Nex" "Bram" "Minn" "Altaramelbalia" "Gia" "Propintlah")
here s a link to the full name list.

Playing with the order affects the quality of the results. For this example and order 3 was over fitting the data - producing mostly names identical to the ones in the training data. Order 2 seems like the best fit.

Post Reply