Exciting updates to the Wikipedia Monthly dataset for November! 🚀
・ Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__ ・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose) ・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!). ・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.
For anyone who likes to get their papers summarized, I've tried to make a bentocard summarizer for research papers, that enables us to get overviews of the papers, with possible deepdives where interested. Of course, this has to be improved a lot, so do check the space and give your feedback!