Word2vec Quick Tutorial using the Default Implementation in C
Last updated:Word2Vec is a novel way to create vector representations of words in a way that preserves their meaning, i.e. words with similar meaning tend to be located in similar positions when represented in the vector space (as vectors).
The first implementation was developed by some folks working at Google.
Here is a very simple guide to downloading and installing word2vec on a Linux box and start using it for basic operations.
Checkout project using SVN
~$ svn checkout http://word2vec.googlecode.com/svn/trunk/ word2vec
Build the project using make
Enter the directory created by svn and run make:
~/word2vec$ make
Run the included shell file demo-word.sh
This will download some data for you (it includes 100 MB worth of Wikipedia article data, but it's compressed so its size is 30MB) and train the model using it.
This can take some time (about 10 minutes for me) because it's doing a lot of processing and will try to run in parallel if you have a multithreaded computer.
This will create an output file called vectors.bin
(the words in vector format).
~/word2vec$ ./demo-word.sh
Once the model has been trained (for example, running
demo-words.sh
as described in the last step), you'll notice that a new file has been created, namelyvectors.bin
. You can then start using the project proper.
Calculating distances between word pairs
~/word2vec$ ./distance vectors.bin
Example output (distance between the word "cat"
and other words)
Enter word or sentence (EXIT to break): cat
Word: cat Position in vocabulary: 2601
Word Cosine distance
------------------------------------------------------------------------
meow 0.621209
cats 0.568651
feline 0.550209
caracal 0.542168
dog 0.538465
kitten 0.535119
purebred 0.529065
felis 0.508320
eared 0.503065
bobcat 0.499513
tapir 0.493953
tabby 0.487100
oncifelis 0.482763
longhair 0.476867
lica 0.462677
(content suppressed)
Finding analogies between words
E.g. "king"
is to "man"
as "queen"
is to ... ?
~/word2vec$ ./word-analogy vectors.bin
Sample output for "king"
, "man"
and "queen"
:
Enter three words (EXIT to break): king man queen
Word: king Position in vocabulary: 187
Word: man Position in vocabulary: 243
Word: queen Position in vocabulary: 903
Word Distance
------------------------------------------------------------------------
woman 0.559819
girl 0.465427
loving 0.460078
wonder 0.434169
maid 0.426818
beautiful 0.422880
lovely 0.418539
thighs 0.413182
gimme 0.410801
love 0.405989
gentleman 0.402662
mister 0.402502
loner 0.392148
content suppressed