{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# End-to-End\n", "\n", "Generate EN Knowledge Graph Triples format -> MS text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This tutorial is available as an IPython notebook at [Malaya-Graph/example/kg-to-text](https://github.com/huseinzol05/Malaya-Graph/tree/master/example/kg-to-text).\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This module only trained on standard language structure, so it is not save to use it for local language structure.\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "This input must be an english knowledge graph triples format.\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# !pip3 install -U git+https://github.com/huseinzol05/malaya@5.0 --no-deps" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "os.environ['CUDA_VISIBLE_DEVICES'] = ''" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import logging\n", "\n", "logging.basicConfig(level=logging.INFO)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3372\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n", "/home/husein/.local/lib/python3.8/site-packages/malaya/tokenizer.py:208: FutureWarning: Possible nested set at position 3890\n", " self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.16 s, sys: 3.53 s, total: 6.69 s\n", "Wall time: 2.17 s\n" ] } ], "source": [ "%%time\n", "\n", "import malaya_graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List available HuggingFace model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:malaya_graph.kg_to_text:tested on test set 02 part translated KELM, https://huggingface.co/datasets/mesolitica/translated-KELM\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Size (MB)BLEUSacreBLEU VerboseSuggested length
mesolitica/finetune-ttkg-t5-tiny-standard-bahasa-cased13961.06784386.1/68.4/55.8/45.9 (BP = 0.980 ratio = 0.980 ...256
mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased24261.55920386.0/68.4/56.1/46.3 (BP = 0.984 ratio = 0.984 ...256
mesolitica/finetune-ttkg-t5-base-standard-bahasa-cased89258.76487684.5/65.8/53.0/43.1 (BP = 0.984 ratio = 0.985 ...256
\n", "
" ], "text/plain": [ " Size (MB) BLEU \\\n", "mesolitica/finetune-ttkg-t5-tiny-standard-bahas... 139 61.067843 \n", "mesolitica/finetune-ttkg-t5-small-standard-baha... 242 61.559203 \n", "mesolitica/finetune-ttkg-t5-base-standard-bahas... 892 58.764876 \n", "\n", " SacreBLEU Verbose \\\n", "mesolitica/finetune-ttkg-t5-tiny-standard-bahas... 86.1/68.4/55.8/45.9 (BP = 0.980 ratio = 0.980 ... \n", "mesolitica/finetune-ttkg-t5-small-standard-baha... 86.0/68.4/56.1/46.3 (BP = 0.984 ratio = 0.984 ... \n", "mesolitica/finetune-ttkg-t5-base-standard-bahas... 84.5/65.8/53.0/43.1 (BP = 0.984 ratio = 0.985 ... \n", "\n", " Suggested length \n", "mesolitica/finetune-ttkg-t5-tiny-standard-bahas... 256 \n", "mesolitica/finetune-ttkg-t5-small-standard-baha... 256 \n", "mesolitica/finetune-ttkg-t5-base-standard-bahas... 256 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "malaya_graph.kg_to_text.available_huggingface()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load HuggingFace model\n", "\n", "```python\n", "def huggingface(model: str = 'mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased', **kwargs):\n", " \"\"\"\n", " Load HuggingFace model to knowledge graph to text.\n", "\n", " Parameters\n", " ----------\n", " model: str, optional (default='mesolitica/finetune-ttkg-t5-small-standard-bahasa-cased')\n", " Check available models at `malaya_graph.kg_to_text.available_huggingface()`.\n", "\n", " Returns\n", " -------\n", " result: malaya_graph.model.text_to_kg.KGtoText\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "ttkg = malaya_graph.text_to_kg.e2e.huggingface()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "model = malaya_graph.kg_to_text.huggingface()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "string1 = \"Yang Berhormat Dato Sri Haji Mohammad Najib bin Tun Haji Abdul Razak ialah ahli politik Malaysia dan merupakan bekas Perdana Menteri Malaysia ke-6 yang mana beliau menjawat jawatan dari 3 April 2009 hingga 9 Mei 2018. Beliau juga pernah berkhidmat sebagai bekas Menteri Kewangan dan merupakan Ahli Parlimen Pekan Pahang\"\n", "string2 = \"Pahang ialah negeri yang ketiga terbesar di Malaysia Terletak di lembangan Sungai Pahang yang amat luas negeri Pahang bersempadan dengan Kelantan di utara Perak Selangor serta Negeri Sembilan di barat Johor di selatan dan Terengganu dan Laut China Selatan di timur.\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'G': ,\n", " 'triple': [{'head': 'Haji Mohammad Najib',\n", " 'type': 'position held',\n", " 'tail': 'Prime Minister of Malaysia'},\n", " {'head': 'Pahang', 'type': 'country', 'tail': 'Malaysia'}],\n", " 'rebel': ' Haji Mohammad Najib Prime Minister of Malaysia position held Pahang Malaysia country'},\n", " {'G': ,\n", " 'triple': [{'head': 'Pahang', 'type': 'country', 'tail': 'Malaysia'},\n", " {'head': 'Pahang', 'type': 'shares border with', 'tail': 'Perak'},\n", " {'head': 'Pahang', 'type': 'shares border with', 'tail': 'Johor'},\n", " {'head': 'Pahang',\n", " 'type': 'located in the administrative territorial entity',\n", " 'tail': 'Malaysia'},\n", " {'head': 'Pahang',\n", " 'type': 'located in or next to body of water',\n", " 'tail': 'Pahang River'}],\n", " 'rebel': ' Pahang Malaysia country Perak shares border with Johor shares border with Malaysia located in the administrative territorial entity Pahang River located in or next to body of water'}]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r = ttkg.generate([string1, string2], do_sample=True, \n", " max_length=256, \n", " top_k=0, \n", " temperature=0.7)\n", "r" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'head': 'Haji Mohammad Najib',\n", " 'type': 'position held',\n", " 'tail': 'Prime Minister of Malaysia'},\n", " {'head': 'Pahang', 'type': 'country', 'tail': 'Malaysia'}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r[0]['triple']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Predict\n", "\n", "```python\n", "def generate(self, kgs: List[List[Dict]], **kwargs):\n", " \"\"\"\n", " Generate a text from list of knowledge graph dictionary.\n", "\n", " Parameters\n", " ----------\n", " kg: List[List[Dict]]\n", " list of list of {'head', 'type', 'tail'}\n", " **kwargs: vector arguments pass to huggingface `generate` method.\n", " Read more at https://huggingface.co/docs/transformers/main_classes/text_generation\n", "\n", " Returns\n", " -------\n", " result: List[str]\n", " \"\"\"\n", "```" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Beliau merupakan Perdana Menteri Malaysia, Haji Mohammad Najib yang baru dilantik, yang dipilih oleh Majlis Bandaraya Pahang sebagai wakil parti kepada Majlis Bandaraya Pahang dari 2010 hingga 2017.',\n", " 'Haji Mohammad Najib merupakan Perdana Menteri Malaysia ke-12, yang terakhir berkhidmat sebagai Menteri Dalam Negeri Pahang dari 19 Mei 2017 hingga 5 September 2019.',\n", " 'Haji Mohammad Najib merupakan Perdana Menteri Malaysia pada 2016 di bawah pentadbiran Sultan Hussein.',\n", " 'Pahang bersempadan dengan Pahang berhampiran Perak, Perak, Johor, Pahang dan Pahang Sungai Pahang.',\n", " 'Pahang bersempadan dengan Sungai Pahang di selatan dan utara dengan Perak di selatan, Johor di barat laut, dan Perak di barat.',\n", " 'Pahang bersempadan dengan negeri Pahang di barat, Selangor di barat, Perak di timur, dan Johor di barat, Sungai Pahang di barat laut dan Sungai Pahang di barat.']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.generate([r_['triple'] for r_ in r], do_sample=True, \n", " max_length=100, \n", " top_k=50, \n", " top_p=0.95, \n", " num_return_sequences=3)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }