| Files | Description |
|---|---|
| process_wiki.py | Process the xml format wikipedia to text format |
| train_word2vec_model.py | Train the pt-br wikipedia word2vec model |
| WikipediaWord2Vec.ipynb | Sample notebook |
- Build Docker image
docker-compose build- Download Wikipedia pt-br dump
curl https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles.xml.bz2 --create-dirs -o data/ptwiki-latest-pages-articles.xml.bz2- Process Wikipedia dump
docker-compose run jupyter python src/process_wiki.py data/ptwiki-latest-pages-articles.xml.bz2 data/wiki.pt-br.text- Train Model
docker-compose run jupyter python src/train_word2vec_model.py data/wiki.pt-br.text data/wiki.pt-br.word2vec.model- Run notebook
docker-compose up -dAccess notebook: localhost:8888
http://textminingonline.com/training-word2vec-model-on-english-wikipedia-by-gensim