From 2c0f19b828567cad1aa1455ec79180e8bcaab07f Mon Sep 17 00:00:00 2001 From: Lekton ZHANG <869944+takato3000@users.noreply.github.com> Date: Wed, 4 Dec 2024 06:15:40 +0800 Subject: [PATCH 1/2] update a method to process the dir path in windows and update filehash of aclImdb dataset --- .../tutorials/keras/text_classification.ipynb | 1945 +++++++++-------- 1 file changed, 973 insertions(+), 972 deletions(-) diff --git a/site/zh-cn/tutorials/keras/text_classification.ipynb b/site/zh-cn/tutorials/keras/text_classification.ipynb index a9beeea6ec..8efe5de600 100644 --- a/site/zh-cn/tutorials/keras/text_classification.ipynb +++ b/site/zh-cn/tutorials/keras/text_classification.ipynb @@ -1,974 +1,975 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "Ic4_occAAiAT" - }, - "source": [ - "##### Copyright 2019 The TensorFlow Authors." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "ioaprt5q5US7" - }, - "outputs": [], - "source": [ - "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "yCl0eTNH5RS3" - }, - "outputs": [], - "source": [ - "#@title MIT License\n", - "#\n", - "# Copyright (c) 2017 François Chollet\n", - "#\n", - "# Permission is hereby granted, free of charge, to any person obtaining a\n", - "# copy of this software and associated documentation files (the \"Software\"),\n", - "# to deal in the Software without restriction, including without limitation\n", - "# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n", - "# and/or sell copies of the Software, and to permit persons to whom the\n", - "# Software is furnished to do so, subject to the following conditions:\n", - "#\n", - "# The above copyright notice and this permission notice shall be included in\n", - "# all copies or substantial portions of the Software.\n", - "#\n", - "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", - "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", - "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n", - "# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", - "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n", - "# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n", - "# DEALINGS IN THE SOFTWARE." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ItXfxkxvosLH" - }, - "source": [ - "# 电影评论文本分类" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hKY4XMc9o8iB" - }, - "source": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Eg62Pmz3o83v" - }, - "source": [ - "本教程演示了从存储在磁盘上的纯文本文件开始的文本分类。您将训练一个二元分类器对 IMDB 数据集执行情感分析。在笔记本的最后,有一个练习供您尝试,您将在其中训练一个多类分类器来预测 Stack Overflow 上编程问题的标签。\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "8RZOuS9LWQvv" - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import os\n", - "import re\n", - "import shutil\n", - "import string\n", - "import tensorflow as tf\n", - "\n", - "from tensorflow.keras import layers\n", - "from tensorflow.keras import losses\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6-tTFS04dChr" - }, - "outputs": [], - "source": [ - "print(tf.__version__)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NBTI1bi8qdFV" - }, - "source": [ - "## 情感分析\n", - "\n", - "此笔记本训练了一个情感分析模型,利用评论文本将电影评论分类为*正面*或*负面*评价。这是一个*二元*(或二类)分类示例,也是一个重要且应用广泛的机器学习问题。\n", - "\n", - "您将使用 [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/),其中包含 [Internet Movie Database](https://www.imdb.com/) 中的 50,000 条电影评论文本 。我们将这些评论分为两组,其中 25,000 条用于训练,另外 25,000 条用于测试。训练集和测试集是*均衡的*,也就是说其中包含相等数量的正面评价和负面评价。\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iAsKG535pHep" - }, - "source": [ - "### 下载并探索 IMDB 数据集\n", - "\n", - "我们下载并提取数据集,然后浏览一下目录结构。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "k7ZYnuajVlFN" - }, - "outputs": [], - "source": [ - "url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n", - "\n", - "dataset = tf.keras.utils.get_file(\"aclImdb_v1\", url,\n", - " untar=True, cache_dir='.',\n", - " cache_subdir='')\n", - "\n", - "dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "355CfOvsV1pl" - }, - "outputs": [], - "source": [ - "os.listdir(dataset_dir)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7ASND15oXpF1" - }, - "outputs": [], - "source": [ - "train_dir = os.path.join(dataset_dir, 'train')\n", - "os.listdir(train_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ysMNMI1CWDFD" - }, - "source": [ - "`aclImdb/train/pos` 和 `aclImdb/train/neg` 目录包含许多文本文件,每个文件都是一条电影评论。我们来看看其中的一条评论。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "R7g8hFvzWLIZ" - }, - "outputs": [], - "source": [ - "sample_file = os.path.join(train_dir, 'pos/1181_9.txt')\n", - "with open(sample_file) as f:\n", - " print(f.read())" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mk20TEm6ZRFP" - }, - "source": [ - "### 加载数据集\n", - "\n", - "接下来,您将从磁盘加载数据并将其准备为适合训练的格式。为此,您将使用有用的 [text_dataset_from_directory](https://tensorflow.google.cn/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory) 实用工具,它期望的目录结构如下所示。\n", - "\n", - "```\n", - "main_directory/\n", - "...class_a/\n", - "......a_text_1.txt\n", - "......a_text_2.txt\n", - "...class_b/\n", - "......b_text_1.txt\n", - "......b_text_2.txt\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nQauv38Lnok3" - }, - "source": [ - "要准备用于二元分类的数据集,磁盘上需要有两个文件夹,分别对应于 `class_a` 和 `class_b`。这些将是正面和负面的电影评论,可以在 `aclImdb/train/pos` 和 `aclImdb/train/neg` 中找到。由于 IMDB 数据集包含其他文件夹,因此您需要在使用此实用工具之前将其移除。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VhejsClzaWfl" - }, - "outputs": [], - "source": [ - "remove_dir = os.path.join(train_dir, 'unsup')\n", - "shutil.rmtree(remove_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "95kkUdRoaeMw" - }, - "source": [ - "接下来,您将使用 `text_dataset_from_directory` 实用工具创建带标签的 `tf.data.Dataset`。[tf.data](https://tensorflow.google.cn/guide/data) 是一组强大的数据处理工具。\n", - "\n", - "运行机器学习实验时,最佳做法是将数据集拆成三份:[训练](https://developers.google.com/machine-learning/glossary#training_set)、[验证](https://developers.google.com/machine-learning/glossary#validation_set) 和 [测试](https://developers.google.com/machine-learning/glossary#test-set)。\n", - "\n", - "IMDB 数据集已经分成训练集和测试集,但缺少验证集。我们来通过下面的 `validation_split` 参数,使用 80:20 拆分训练数据来创建验证集。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nOrK-MTYaw3C" - }, - "outputs": [], - "source": [ - "batch_size = 32\n", - "seed = 42\n", - "\n", - "raw_train_ds = tf.keras.utils.text_dataset_from_directory(\n", - " 'aclImdb/train', \n", - " batch_size=batch_size, \n", - " validation_split=0.2, \n", - " subset='training', \n", - " seed=seed)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5Y33oxOUpYkh" - }, - "source": [ - "如上所示,训练文件夹中有 25,000 个样本,您将使用其中的 80%(或 20,000 个)进行训练。稍后您将看到,您可以通过将数据集直接传递给 `model.fit` 来训练模型。如果您不熟悉 `tf.data`,还可以遍历数据集并打印出一些样本,如下所示。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "51wNaPPApk1K" - }, - "outputs": [], - "source": [ - "for text_batch, label_batch in raw_train_ds.take(1):\n", - " for i in range(3):\n", - " print(\"Review\", text_batch.numpy()[i])\n", - " print(\"Label\", label_batch.numpy()[i])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JWq1SUIrp1a-" - }, - "source": [ - "请注意,评论包含原始文本(带有标点符号和偶尔出现的 HTML 代码,如 `
`)。我们将在以下部分展示如何处理这些问题。\n", - "\n", - "标签为 0 或 1。要查看它们与正面和负面电影评论的对应关系,可以查看数据集上的 `class_names` 属性。\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MlICTG8spyO2" - }, - "outputs": [], - "source": [ - "print(\"Label 0 corresponds to\", raw_train_ds.class_names[0])\n", - "print(\"Label 1 corresponds to\", raw_train_ds.class_names[1])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pbdO39vYqdJr" - }, - "source": [ - "接下来,您将创建验证数据集和测试数据集。您将使用训练集中剩余的 5,000 条评论进行验证。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SzxazN8Hq1pF" - }, - "source": [ - "注:使用 `validation_split` 和 `subset` 参数时,请确保要么指定随机种子,要么传递 `shuffle=False`,这样验证拆分和训练拆分就不会重叠。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JsMwwhOoqjKF" - }, - "outputs": [], - "source": [ - "raw_val_ds = tf.keras.utils.text_dataset_from_directory(\n", - " 'aclImdb/train', \n", - " batch_size=batch_size, \n", - " validation_split=0.2, \n", - " subset='validation', \n", - " seed=seed)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rdSr0Nt3q_ns" - }, - "outputs": [], - "source": [ - "raw_test_ds = tf.keras.utils.text_dataset_from_directory(\n", - " 'aclImdb/test', \n", - " batch_size=batch_size)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qJmTiO0IYAjm" - }, - "source": [ - "### 准备用于训练的数据集\n", - "\n", - "接下来,您将使用有用的 `tf.keras.layers.TextVectorization` 层对数据进行标准化、词例化和向量化。\n", - "\n", - "标准化是指对文本进行预处理,通常是移除标点符号或 HTML 元素以简化数据集。词例化是指将字符串分割成词例(例如,通过空格将句子分割成单个单词)。向量化是指将词例转换为数字,以便将它们输入神经网络。所有这些任务都可以通过这个层完成。\n", - "\n", - "正如您在上面看到的,评论包含各种 HTML 代码,例如 `
`。`TextVectorization` 层(默认情况下会将文本转换为小写并去除标点符号,但不会去除 HTML)中的默认标准化程序不会移除这些代码。您将编写一个自定义标准化函数来移除 HTML。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZVcHl-SLrH-u" - }, - "source": [ - "注:为了防止[训练-测试偏差](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)(也称为训练-应用偏差),在训练和测试时间对数据进行相同的预处理非常重要。为此,可以将 `TextVectorization` 层直接包含在模型中,如本教程后面所示。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SDRI_s_tX1Hk" - }, - "outputs": [], - "source": [ - "def custom_standardization(input_data):\n", - " lowercase = tf.strings.lower(input_data)\n", - " stripped_html = tf.strings.regex_replace(lowercase, '
', ' ')\n", - " return tf.strings.regex_replace(stripped_html,\n", - " '[%s]' % re.escape(string.punctuation),\n", - " '')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d2d3Aw8dsUux" - }, - "source": [ - "
接下来,您将创建一个 `TextVectorization` 层。您将使用该层对我们的数据进行标准化、词例化和向量化。您将 `output_mode` 设置为 `int` 以便为每个词例创建唯一的整数索引。\n", - "\n", - "请注意,您使用的是默认拆分函数,以及您在上面定义的自定义标准化函数。您还将为模型定义一些常量,例如显式的最大 `sequence_length`,这会使层将序列填充或截断为精确的 `sequence_length` 值。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-c76RvSzsMnX" - }, - "outputs": [], - "source": [ - "max_features = 10000\n", - "sequence_length = 250\n", - "\n", - "vectorize_layer = layers.TextVectorization(\n", - " standardize=custom_standardization,\n", - " max_tokens=max_features,\n", - " output_mode='int',\n", - " output_sequence_length=sequence_length)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vlFOpfF6scT6" - }, - "source": [ - "接下来,您将调用 `adapt` 以使预处理层的状态适合数据集。这会使模型构建字符串到整数的索引。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lAhdjK7AtroA" - }, - "source": [ - "注:在调用时请务必仅使用您的训练数据(使用测试集会泄漏信息)。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "GH4_2ZGJsa_X" - }, - "outputs": [], - "source": [ - "# Make a text-only dataset (without labels), then call adapt\n", - "train_text = raw_train_ds.map(lambda x, y: x)\n", - "vectorize_layer.adapt(train_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SHQVEFzNt-K_" - }, - "source": [ - "我们来创建一个函数来查看使用该层预处理一些数据的结果。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SCIg_T50wOCU" - }, - "outputs": [], - "source": [ - "def vectorize_text(text, label):\n", - " text = tf.expand_dims(text, -1)\n", - " return vectorize_layer(text), label" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XULcm6B3xQIO" - }, - "outputs": [], - "source": [ - "# retrieve a batch (of 32 reviews and labels) from the dataset\n", - "text_batch, label_batch = next(iter(raw_train_ds))\n", - "first_review, first_label = text_batch[0], label_batch[0]\n", - "print(\"Review\", first_review)\n", - "print(\"Label\", raw_train_ds.class_names[first_label])\n", - "print(\"Vectorized review\", vectorize_text(first_review, first_label))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6u5EX0hxyNZT" - }, - "source": [ - "正如您在上面看到的,每个词例都被一个整数替换了。您可以通过在该层上调用 `.get_vocabulary()` 来查找每个整数对应的词例(字符串)。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kRq9hTQzhVhW" - }, - "outputs": [], - "source": [ - "print(\"1287 ---> \",vectorize_layer.get_vocabulary()[1287])\n", - "print(\" 313 ---> \",vectorize_layer.get_vocabulary()[313])\n", - "print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XD2H6utRydGv" - }, - "source": [ - "你几乎已经准备好训练你的模型了。作为最后的预处理步骤,你将在训练、验证和测试数据集上应用之前创建的TextVectorization层。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2zhmpeViI1iG" - }, - "outputs": [], - "source": [ - "train_ds = raw_train_ds.map(vectorize_text)\n", - "val_ds = raw_val_ds.map(vectorize_text)\n", - "test_ds = raw_test_ds.map(vectorize_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YsVQyPMizjuO" - }, - "source": [ - "### 配置数据集以提高性能\n", - "\n", - "以下是加载数据时应该使用的两种重要方法,以确保 I/O 不会阻塞。\n", - "\n", - "从磁盘加载后,`.cache()` 会将数据保存在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果您的数据集太大而无法放入内存,也可以使用此方法创建高性能的磁盘缓存,这比许多小文件的读取效率更高。\n", - "\n", - "`prefetch()` 会在训练时将数据预处理和模型执行重叠。\n", - "\n", - "您可以在[数据性能指南](https://tensorflow.google.cn/guide/data_performance)中深入了解这两种方法,以及如何将数据缓存到磁盘。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wMcs_H7izm5m" - }, - "outputs": [], - "source": [ - "AUTOTUNE = tf.data.AUTOTUNE\n", - "\n", - "train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", - "val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", - "test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LLC02j2g-llC" - }, - "source": [ - "### 创建模型\n", - "\n", - "是时候创建您的神经网络了:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "dkQP6in8yUBR" - }, - "outputs": [], - "source": [ - "embedding_dim = 16" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "xpKOoWgu-llD" - }, - "outputs": [], - "source": [ - "model = tf.keras.Sequential([\n", - " layers.Embedding(max_features + 1, embedding_dim),\n", - " layers.Dropout(0.2),\n", - " layers.GlobalAveragePooling1D(),\n", - " layers.Dropout(0.2),\n", - " layers.Dense(1)])\n", - "\n", - "model.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6PbKQ6mucuKL" - }, - "source": [ - "层按顺序堆叠以构建分类器:\n", - "\n", - "1. 第一个层是 `Embedding` 层。此层采用整数编码的评论,并查找每个单词索引的嵌入向量。这些向量是通过模型训练学习到的。向量向输出数组增加了一个维度。得到的维度为:`(batch, sequence, embedding)`。要详细了解嵌入向量,请参阅[单词嵌入向量](https://tensorflow.google.cn/text/guide/word_embeddings)教程。\n", - "2. 接下来,`GlobalAveragePooling1D` 将通过对序列维度求平均值来为每个样本返回一个定长输出向量。这允许模型以尽可能最简单的方式处理变长输入。\n", - "3. 最后一层与单个输出结点密集连接。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L4EqVWg4-llM" - }, - "source": [ - "### 损失函数与优化器\n", - "\n", - "模型训练需要一个损失函数和一个优化器。由于这是一个二元分类问题,并且模型输出概率(具有 Sigmoid 激活的单一单元层),我们将使用 `losses.BinaryCrossentropy` 损失函数。\n", - "\n", - "现在,配置模型以使用优化器和损失函数:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Mr0GP-cQ-llN" - }, - "outputs": [], - "source": [ - "model.compile(loss=losses.BinaryCrossentropy(from_logits=True),\n", - " optimizer='adam',\n", - " metrics=tf.metrics.BinaryAccuracy(threshold=0.0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "35jv_fzP-llU" - }, - "source": [ - "### 训练模型\n", - "\n", - "将 `dataset` 对象传递给 fit 方法,对模型进行训练。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tXSGrjWZ-llW" - }, - "outputs": [], - "source": [ - "epochs = 10\n", - "history = model.fit(\n", - " train_ds,\n", - " validation_data=val_ds,\n", - " epochs=epochs)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9EEGuDVuzb5r" - }, - "source": [ - "### 评估模型\n", - "\n", - "我们来看一下模型的性能如何。将返回两个值。损失值(loss)(一个表示误差的数字,值越低越好)与准确率(accuracy)。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zOMKywn4zReN" - }, - "outputs": [], - "source": [ - "loss, accuracy = model.evaluate(test_ds)\n", - "\n", - "print(\"Loss: \", loss)\n", - "print(\"Accuracy: \", accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "z1iEXVTR0Z2t" - }, - "source": [ - "这种十分简单的方式实现了约 86% 的准确率。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ldbQqCw2Xc1W" - }, - "source": [ - "### 创建准确率和损失随时间变化的图表\n", - "\n", - "`model.fit()` 会返回包含一个字典的 `History` 对象。该字典包含训练过程中产生的所有信息:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-YcvZsdvWfDf" - }, - "outputs": [], - "source": [ - "history_dict = history.history\n", - "history_dict.keys()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1_CH32qJXruI" - }, - "source": [ - "其中有四个条目:每个条目代表训练和验证过程中的一项监测指标。您可以使用这些指标来绘制用于比较的训练损失和验证损失图表,以及训练准确率和验证准确率图表:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2SEMeQ5YXs8z" - }, - "outputs": [], - "source": [ - "acc = history_dict['binary_accuracy']\n", - "val_acc = history_dict['val_binary_accuracy']\n", - "loss = history_dict['loss']\n", - "val_loss = history_dict['val_loss']\n", - "\n", - "epochs = range(1, len(acc) + 1)\n", - "\n", - "# \"bo\" is for \"blue dot\"\n", - "plt.plot(epochs, loss, 'bo', label='Training loss')\n", - "# b is for \"solid blue line\"\n", - "plt.plot(epochs, val_loss, 'b', label='Validation loss')\n", - "plt.title('Training and validation loss')\n", - "plt.xlabel('Epochs')\n", - "plt.ylabel('Loss')\n", - "plt.legend()\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Z3PJemLPXwz_" - }, - "outputs": [], - "source": [ - "plt.plot(epochs, acc, 'bo', label='Training acc')\n", - "plt.plot(epochs, val_acc, 'b', label='Validation acc')\n", - "plt.title('Training and validation accuracy')\n", - "plt.xlabel('Epochs')\n", - "plt.ylabel('Accuracy')\n", - "plt.legend(loc='lower right')\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hFFyCuJoXy7r" - }, - "source": [ - "在该图表中,虚线代表训练损失和准确率,实线代表验证损失和准确率。\n", - "\n", - "请注意,训练损失会逐周期*下降*,而训练准确率则逐周期*上升*。使用梯度下降优化时,这是预期结果,它应该在每次迭代中最大限度减少所需的数量。\n", - "\n", - "但是,对于验证损失和准确率来说则不然——它们似乎会在训练转确率之前达到顶点。这是过拟合的一个例子:模型在训练数据上的表现要好于在之前从未见过的数据上的表现。经过这一点之后,模型会过度优化和学习*特定*于训练数据的表示,但无法*泛化*到测试数据。\n", - "\n", - "对于这种特殊情况,您可以通过在验证准确率不再增加时直接停止训练来防止过度拟合。一种方式是使用 `tf.keras.callbacks.EarlyStopping` 回调。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-to23J3Vy5d3" - }, - "source": [ - "## 导出模型\n", - "\n", - "在上面的代码中,您在向模型馈送文本之前对数据集应用了 `TextVectorization`。 如果您想让模型能够处理原始字符串(例如,为了简化部署),您可以在模型中包含 `TextVectorization` 层。为此,您可以使用刚刚训练的权重创建一个新模型。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "FWXsMvryuZuq" - }, - "outputs": [], - "source": [ - "export_model = tf.keras.Sequential([\n", - " vectorize_layer,\n", - " model,\n", - " layers.Activation('sigmoid')\n", - "])\n", - "\n", - "export_model.compile(\n", - " loss=losses.BinaryCrossentropy(from_logits=False), optimizer=\"adam\", metrics=['accuracy']\n", - ")\n", - "\n", - "# Test it with `raw_test_ds`, which yields raw strings\n", - "loss, accuracy = export_model.evaluate(raw_test_ds)\n", - "print(accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TwQgoN88LoEF" - }, - "source": [ - "### 使用新数据进行推断\n", - "\n", - "要获得对新样本的预测,只需调用 `model.predict()` 即可。" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QW355HH5L49K" - }, - "outputs": [], - "source": [ - "examples = [\n", - " \"The movie was great!\",\n", - " \"The movie was okay.\",\n", - " \"The movie was terrible...\"\n", - "]\n", - "\n", - "export_model.predict(examples)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MaxlpFWpzR6c" - }, - "source": [ - "将文本预处理逻辑包含在模型中后,您可以导出用于生产的模型,从而简化部署并降低[训练/测试偏差](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)的可能性。\n", - "\n", - "在选择应用 TextVectorization 层的位置时,需要注意性能差异。在模型之外使用它可以让您在 GPU 上训练时进行异步 CPU 处理和数据缓冲。因此,如果您在 GPU 上训练模型,您应该在开发模型时使用此选项以获得最佳性能,然后在准备好部署时进行切换,在模型中包含 TextVectorization 层。\n", - "\n", - "请参阅此[教程](https://tensorflow.google.cn/tutorials/keras/save_and_load),详细了解如何保存模型。" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eSSuci_6nCEG" - }, - "source": [ - "## 练习:对 Stack Overflow 问题进行多类分类\n", - "\n", - "本教程展示了如何在 IMDB 数据集上从头开始训练二元分类器。作为练习,您可以修改此笔记本以训练多类分类器来预测 [Stack Overflow](http://stackoverflow.com/) 上的编程问题的标签。\n", - "\n", - "我们已经准备好了一个[数据集](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz)供您使用,其中包含了几千个发布在 Stack Overflow 上的编程问题(例如,\"How can sort a dictionary by value in Python?\")。每一个问题都只有一个标签(Python、CSharp、JavaScript 或 Java)。您的任务是将问题作为输入,并预测适当的标签,在本例中为 Python。\n", - "\n", - "您将使用的数据集包含从 [BigQuery](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow) 上更大的公共 Stack Overflow 数据集提取的数千个问题,其中包含超过 1700 万个帖子。\n", - "\n", - "下载数据集后,您会发现它与您之前使用的 IMDB 数据集具有相似的目录结构:\n", - "\n", - "```\n", - "train/\n", - "...python/\n", - "......0.txt\n", - "......1.txt\n", - "...javascript/\n", - "......0.txt\n", - "......1.txt\n", - "...csharp/\n", - "......0.txt\n", - "......1.txt\n", - "...java/\n", - "......0.txt\n", - "......1.txt\n", - "```\n", - "\n", - "注:为了增加分类问题的难度,编程问题中出现的 Python、CSharp、JavaScript 或 Java 等词已被替换为 *blank*(因为许多问题都包含它们所涉及的语言)。\n", - "\n", - "要完成此练习,您应该对此笔记本进行以下修改以使用 Stack Overflow 数据集:\n", - "\n", - "1. 在笔记本顶部,将下载 IMDB 数据集的代码更新为下载前面准备好的 [Stack Overflow 数据集](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz)的代码。由于 Stack Overflow 数据集具有类似的目录结构,因此您不需要进行太多修改。\n", - "\n", - "2. 将模型的最后一层修改为 `Dense(4)`,因为现在有四个输出类。\n", - "\n", - "3. 编译模型时,将损失更改为 `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)`。当每个类的标签是整数(在本例中,它们可以是 0、*1*、*2* 或 *3*)时,这是用于多类分类问题的正确损失函数。 此外,将指标更改为 `metrics=['accuracy']`,因为这是一个多类分类问题(`tf.metrics.BinaryAccuracy` 仅用于二元分类器 )。\n", - "\n", - "4. 在绘制随时间变化的准确率时,请将 `binary_accuracy` 和 `val_binary_accuracy` 分别更改为 `accuracy` 和 `val_accuracy`。\n", - "\n", - "5. 完成这些更改后,就可以训练多类分类器了。 " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F0T5SIwSm7uc" - }, - "source": [ - "## 了解更多信息\n", - "\n", - "本教程从头开始介绍了文本分类。要详细了解一般的文本分类工作流程,请查看 Google Developers 提供的[文本分类指南](https://developers.google.com/machine-learning/guides/text-classification/)。\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "text_classification.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Ic4_occAAiAT" + }, + "source": [ + "##### Copyright 2019 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ioaprt5q5US7" + }, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "yCl0eTNH5RS3" + }, + "outputs": [], + "source": [ + "#@title MIT License\n", + "#\n", + "# Copyright (c) 2017 François Chollet\n", + "#\n", + "# Permission is hereby granted, free of charge, to any person obtaining a\n", + "# copy of this software and associated documentation files (the \"Software\"),\n", + "# to deal in the Software without restriction, including without limitation\n", + "# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n", + "# and/or sell copies of the Software, and to permit persons to whom the\n", + "# Software is furnished to do so, subject to the following conditions:\n", + "#\n", + "# The above copyright notice and this permission notice shall be included in\n", + "# all copies or substantial portions of the Software.\n", + "#\n", + "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", + "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", + "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n", + "# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", + "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n", + "# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n", + "# DEALINGS IN THE SOFTWARE." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ItXfxkxvosLH" + }, + "source": [ + "# 电影评论文本分类" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hKY4XMc9o8iB" + }, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Eg62Pmz3o83v" + }, + "source": [ + "本教程演示了从存储在磁盘上的纯文本文件开始的文本分类。您将训练一个二元分类器对 IMDB 数据集执行情感分析。在笔记本的最后,有一个练习供您尝试,您将在其中训练一个多类分类器来预测 Stack Overflow 上编程问题的标签。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8RZOuS9LWQvv" + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import os\n", + "import re\n", + "import shutil\n", + "import string\n", + "import tensorflow as tf\n", + "\n", + "from tensorflow.keras import layers\n", + "from tensorflow.keras import losses\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6-tTFS04dChr" + }, + "outputs": [], + "source": [ + "print(tf.__version__)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NBTI1bi8qdFV" + }, + "source": [ + "## 情感分析\n", + "\n", + "此笔记本训练了一个情感分析模型,利用评论文本将电影评论分类为*正面*或*负面*评价。这是一个*二元*(或二类)分类示例,也是一个重要且应用广泛的机器学习问题。\n", + "\n", + "您将使用 [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/),其中包含 [Internet Movie Database](https://www.imdb.com/) 中的 50,000 条电影评论文本 。我们将这些评论分为两组,其中 25,000 条用于训练,另外 25,000 条用于测试。训练集和测试集是*均衡的*,也就是说其中包含相等数量的正面评价和负面评价。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iAsKG535pHep" + }, + "source": [ + "### 下载并探索 IMDB 数据集\n", + "\n", + "我们下载并提取数据集,然后浏览一下目录结构。Windows 可能会碰到目录问题,可以使用 dataset_dir = os.path.join(dataset, 'aclImdb')。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k7ZYnuajVlFN" + }, + "outputs": [], + "source": [ + "url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n", + "file_hash = \"c40f74a18d3b61f90feba1e17730e0d38e8b97c05fde7008942e91923d1658fe\"\n", + "\n", + "dataset = tf.keras.utils.get_file(fname=\"aclImdb_v1\", origin=url,\n", + " extract=True, cache_dir='.',\n", + " cache_subdir='', file_hash=file_hash, hash_algorithm='sha256')\n", + "\n", + "dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "355CfOvsV1pl" + }, + "outputs": [], + "source": [ + "os.listdir(dataset_dir)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7ASND15oXpF1" + }, + "outputs": [], + "source": [ + "train_dir = os.path.join(dataset_dir, 'train')\n", + "os.listdir(train_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ysMNMI1CWDFD" + }, + "source": [ + "`aclImdb/train/pos` 和 `aclImdb/train/neg` 目录包含许多文本文件,每个文件都是一条电影评论。我们来看看其中的一条评论。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R7g8hFvzWLIZ" + }, + "outputs": [], + "source": [ + "sample_file = os.path.join(train_dir, 'pos/1181_9.txt')\n", + "with open(sample_file) as f:\n", + " print(f.read())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mk20TEm6ZRFP" + }, + "source": [ + "### 加载数据集\n", + "\n", + "接下来,您将从磁盘加载数据并将其准备为适合训练的格式。为此,您将使用有用的 [text_dataset_from_directory](https://tensorflow.google.cn/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory) 实用工具,它期望的目录结构如下所示。\n", + "\n", + "```\n", + "main_directory/\n", + "...class_a/\n", + "......a_text_1.txt\n", + "......a_text_2.txt\n", + "...class_b/\n", + "......b_text_1.txt\n", + "......b_text_2.txt\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nQauv38Lnok3" + }, + "source": [ + "要准备用于二元分类的数据集,磁盘上需要有两个文件夹,分别对应于 `class_a` 和 `class_b`。这些将是正面和负面的电影评论,可以在 `aclImdb/train/pos` 和 `aclImdb/train/neg` 中找到。由于 IMDB 数据集包含其他文件夹,因此您需要在使用此实用工具之前将其移除。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VhejsClzaWfl" + }, + "outputs": [], + "source": [ + "remove_dir = os.path.join(train_dir, 'unsup')\n", + "shutil.rmtree(remove_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "95kkUdRoaeMw" + }, + "source": [ + "接下来,您将使用 `text_dataset_from_directory` 实用工具创建带标签的 `tf.data.Dataset`。[tf.data](https://tensorflow.google.cn/guide/data) 是一组强大的数据处理工具。\n", + "\n", + "运行机器学习实验时,最佳做法是将数据集拆成三份:[训练](https://developers.google.com/machine-learning/glossary#training_set)、[验证](https://developers.google.com/machine-learning/glossary#validation_set) 和 [测试](https://developers.google.com/machine-learning/glossary#test-set)。\n", + "\n", + "IMDB 数据集已经分成训练集和测试集,但缺少验证集。我们来通过下面的 `validation_split` 参数,使用 80:20 拆分训练数据来创建验证集。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nOrK-MTYaw3C" + }, + "outputs": [], + "source": [ + "batch_size = 32\n", + "seed = 42\n", + "\n", + "raw_train_ds = tf.keras.utils.text_dataset_from_directory(\n", + " 'aclImdb/train', \n", + " batch_size=batch_size, \n", + " validation_split=0.2, \n", + " subset='training', \n", + " seed=seed)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5Y33oxOUpYkh" + }, + "source": [ + "如上所示,训练文件夹中有 25,000 个样本,您将使用其中的 80%(或 20,000 个)进行训练。稍后您将看到,您可以通过将数据集直接传递给 `model.fit` 来训练模型。如果您不熟悉 `tf.data`,还可以遍历数据集并打印出一些样本,如下所示。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "51wNaPPApk1K" + }, + "outputs": [], + "source": [ + "for text_batch, label_batch in raw_train_ds.take(1):\n", + " for i in range(3):\n", + " print(\"Review\", text_batch.numpy()[i])\n", + " print(\"Label\", label_batch.numpy()[i])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JWq1SUIrp1a-" + }, + "source": [ + "请注意,评论包含原始文本(带有标点符号和偶尔出现的 HTML 代码,如 `
`)。我们将在以下部分展示如何处理这些问题。\n", + "\n", + "标签为 0 或 1。要查看它们与正面和负面电影评论的对应关系,可以查看数据集上的 `class_names` 属性。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MlICTG8spyO2" + }, + "outputs": [], + "source": [ + "print(\"Label 0 corresponds to\", raw_train_ds.class_names[0])\n", + "print(\"Label 1 corresponds to\", raw_train_ds.class_names[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pbdO39vYqdJr" + }, + "source": [ + "接下来,您将创建验证数据集和测试数据集。您将使用训练集中剩余的 5,000 条评论进行验证。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SzxazN8Hq1pF" + }, + "source": [ + "注:使用 `validation_split` 和 `subset` 参数时,请确保要么指定随机种子,要么传递 `shuffle=False`,这样验证拆分和训练拆分就不会重叠。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JsMwwhOoqjKF" + }, + "outputs": [], + "source": [ + "raw_val_ds = tf.keras.utils.text_dataset_from_directory(\n", + " 'aclImdb/train', \n", + " batch_size=batch_size, \n", + " validation_split=0.2, \n", + " subset='validation', \n", + " seed=seed)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rdSr0Nt3q_ns" + }, + "outputs": [], + "source": [ + "raw_test_ds = tf.keras.utils.text_dataset_from_directory(\n", + " 'aclImdb/test', \n", + " batch_size=batch_size)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qJmTiO0IYAjm" + }, + "source": [ + "### 准备用于训练的数据集\n", + "\n", + "接下来,您将使用有用的 `tf.keras.layers.TextVectorization` 层对数据进行标准化、词例化和向量化。\n", + "\n", + "标准化是指对文本进行预处理,通常是移除标点符号或 HTML 元素以简化数据集。词例化是指将字符串分割成词例(例如,通过空格将句子分割成单个单词)。向量化是指将词例转换为数字,以便将它们输入神经网络。所有这些任务都可以通过这个层完成。\n", + "\n", + "正如您在上面看到的,评论包含各种 HTML 代码,例如 `
`。`TextVectorization` 层(默认情况下会将文本转换为小写并去除标点符号,但不会去除 HTML)中的默认标准化程序不会移除这些代码。您将编写一个自定义标准化函数来移除 HTML。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZVcHl-SLrH-u" + }, + "source": [ + "注:为了防止[训练-测试偏差](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)(也称为训练-应用偏差),在训练和测试时间对数据进行相同的预处理非常重要。为此,可以将 `TextVectorization` 层直接包含在模型中,如本教程后面所示。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SDRI_s_tX1Hk" + }, + "outputs": [], + "source": [ + "def custom_standardization(input_data):\n", + " lowercase = tf.strings.lower(input_data)\n", + " stripped_html = tf.strings.regex_replace(lowercase, '
', ' ')\n", + " return tf.strings.regex_replace(stripped_html,\n", + " '[%s]' % re.escape(string.punctuation),\n", + " '')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d2d3Aw8dsUux" + }, + "source": [ + "
接下来,您将创建一个 `TextVectorization` 层。您将使用该层对我们的数据进行标准化、词例化和向量化。您将 `output_mode` 设置为 `int` 以便为每个词例创建唯一的整数索引。\n", + "\n", + "请注意,您使用的是默认拆分函数,以及您在上面定义的自定义标准化函数。您还将为模型定义一些常量,例如显式的最大 `sequence_length`,这会使层将序列填充或截断为精确的 `sequence_length` 值。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-c76RvSzsMnX" + }, + "outputs": [], + "source": [ + "max_features = 10000\n", + "sequence_length = 250\n", + "\n", + "vectorize_layer = layers.TextVectorization(\n", + " standardize=custom_standardization,\n", + " max_tokens=max_features,\n", + " output_mode='int',\n", + " output_sequence_length=sequence_length)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vlFOpfF6scT6" + }, + "source": [ + "接下来,您将调用 `adapt` 以使预处理层的状态适合数据集。这会使模型构建字符串到整数的索引。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lAhdjK7AtroA" + }, + "source": [ + "注:在调用时请务必仅使用您的训练数据(使用测试集会泄漏信息)。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GH4_2ZGJsa_X" + }, + "outputs": [], + "source": [ + "# Make a text-only dataset (without labels), then call adapt\n", + "train_text = raw_train_ds.map(lambda x, y: x)\n", + "vectorize_layer.adapt(train_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SHQVEFzNt-K_" + }, + "source": [ + "我们来创建一个函数来查看使用该层预处理一些数据的结果。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SCIg_T50wOCU" + }, + "outputs": [], + "source": [ + "def vectorize_text(text, label):\n", + " text = tf.expand_dims(text, -1)\n", + " return vectorize_layer(text), label" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XULcm6B3xQIO" + }, + "outputs": [], + "source": [ + "# retrieve a batch (of 32 reviews and labels) from the dataset\n", + "text_batch, label_batch = next(iter(raw_train_ds))\n", + "first_review, first_label = text_batch[0], label_batch[0]\n", + "print(\"Review\", first_review)\n", + "print(\"Label\", raw_train_ds.class_names[first_label])\n", + "print(\"Vectorized review\", vectorize_text(first_review, first_label))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6u5EX0hxyNZT" + }, + "source": [ + "正如您在上面看到的,每个词例都被一个整数替换了。您可以通过在该层上调用 `.get_vocabulary()` 来查找每个整数对应的词例(字符串)。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kRq9hTQzhVhW" + }, + "outputs": [], + "source": [ + "print(\"1287 ---> \",vectorize_layer.get_vocabulary()[1287])\n", + "print(\" 313 ---> \",vectorize_layer.get_vocabulary()[313])\n", + "print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XD2H6utRydGv" + }, + "source": [ + "你几乎已经准备好训练你的模型了。作为最后的预处理步骤,你将在训练、验证和测试数据集上应用之前创建的TextVectorization层。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2zhmpeViI1iG" + }, + "outputs": [], + "source": [ + "train_ds = raw_train_ds.map(vectorize_text)\n", + "val_ds = raw_val_ds.map(vectorize_text)\n", + "test_ds = raw_test_ds.map(vectorize_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YsVQyPMizjuO" + }, + "source": [ + "### 配置数据集以提高性能\n", + "\n", + "以下是加载数据时应该使用的两种重要方法,以确保 I/O 不会阻塞。\n", + "\n", + "从磁盘加载后,`.cache()` 会将数据保存在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果您的数据集太大而无法放入内存,也可以使用此方法创建高性能的磁盘缓存,这比许多小文件的读取效率更高。\n", + "\n", + "`prefetch()` 会在训练时将数据预处理和模型执行重叠。\n", + "\n", + "您可以在[数据性能指南](https://tensorflow.google.cn/guide/data_performance)中深入了解这两种方法,以及如何将数据缓存到磁盘。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wMcs_H7izm5m" + }, + "outputs": [], + "source": [ + "AUTOTUNE = tf.data.AUTOTUNE\n", + "\n", + "train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", + "val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", + "test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LLC02j2g-llC" + }, + "source": [ + "### 创建模型\n", + "\n", + "是时候创建您的神经网络了:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dkQP6in8yUBR" + }, + "outputs": [], + "source": [ + "embedding_dim = 16" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xpKOoWgu-llD" + }, + "outputs": [], + "source": [ + "model = tf.keras.Sequential([\n", + " layers.Embedding(max_features + 1, embedding_dim),\n", + " layers.Dropout(0.2),\n", + " layers.GlobalAveragePooling1D(),\n", + " layers.Dropout(0.2),\n", + " layers.Dense(1)])\n", + "\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6PbKQ6mucuKL" + }, + "source": [ + "层按顺序堆叠以构建分类器:\n", + "\n", + "1. 第一个层是 `Embedding` 层。此层采用整数编码的评论,并查找每个单词索引的嵌入向量。这些向量是通过模型训练学习到的。向量向输出数组增加了一个维度。得到的维度为:`(batch, sequence, embedding)`。要详细了解嵌入向量,请参阅[单词嵌入向量](https://tensorflow.google.cn/text/guide/word_embeddings)教程。\n", + "2. 接下来,`GlobalAveragePooling1D` 将通过对序列维度求平均值来为每个样本返回一个定长输出向量。这允许模型以尽可能最简单的方式处理变长输入。\n", + "3. 最后一层与单个输出结点密集连接。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L4EqVWg4-llM" + }, + "source": [ + "### 损失函数与优化器\n", + "\n", + "模型训练需要一个损失函数和一个优化器。由于这是一个二元分类问题,并且模型输出概率(具有 Sigmoid 激活的单一单元层),我们将使用 `losses.BinaryCrossentropy` 损失函数。\n", + "\n", + "现在,配置模型以使用优化器和损失函数:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Mr0GP-cQ-llN" + }, + "outputs": [], + "source": [ + "model.compile(loss=losses.BinaryCrossentropy(from_logits=True),\n", + " optimizer='adam',\n", + " metrics=tf.metrics.BinaryAccuracy(threshold=0.0))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "35jv_fzP-llU" + }, + "source": [ + "### 训练模型\n", + "\n", + "将 `dataset` 对象传递给 fit 方法,对模型进行训练。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tXSGrjWZ-llW" + }, + "outputs": [], + "source": [ + "epochs = 10\n", + "history = model.fit(\n", + " train_ds,\n", + " validation_data=val_ds,\n", + " epochs=epochs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9EEGuDVuzb5r" + }, + "source": [ + "### 评估模型\n", + "\n", + "我们来看一下模型的性能如何。将返回两个值。损失值(loss)(一个表示误差的数字,值越低越好)与准确率(accuracy)。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zOMKywn4zReN" + }, + "outputs": [], + "source": [ + "loss, accuracy = model.evaluate(test_ds)\n", + "\n", + "print(\"Loss: \", loss)\n", + "print(\"Accuracy: \", accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z1iEXVTR0Z2t" + }, + "source": [ + "这种十分简单的方式实现了约 86% 的准确率。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ldbQqCw2Xc1W" + }, + "source": [ + "### 创建准确率和损失随时间变化的图表\n", + "\n", + "`model.fit()` 会返回包含一个字典的 `History` 对象。该字典包含训练过程中产生的所有信息:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-YcvZsdvWfDf" + }, + "outputs": [], + "source": [ + "history_dict = history.history\n", + "history_dict.keys()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1_CH32qJXruI" + }, + "source": [ + "其中有四个条目:每个条目代表训练和验证过程中的一项监测指标。您可以使用这些指标来绘制用于比较的训练损失和验证损失图表,以及训练准确率和验证准确率图表:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2SEMeQ5YXs8z" + }, + "outputs": [], + "source": [ + "acc = history_dict['binary_accuracy']\n", + "val_acc = history_dict['val_binary_accuracy']\n", + "loss = history_dict['loss']\n", + "val_loss = history_dict['val_loss']\n", + "\n", + "epochs = range(1, len(acc) + 1)\n", + "\n", + "# \"bo\" is for \"blue dot\"\n", + "plt.plot(epochs, loss, 'bo', label='Training loss')\n", + "# b is for \"solid blue line\"\n", + "plt.plot(epochs, val_loss, 'b', label='Validation loss')\n", + "plt.title('Training and validation loss')\n", + "plt.xlabel('Epochs')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Z3PJemLPXwz_" + }, + "outputs": [], + "source": [ + "plt.plot(epochs, acc, 'bo', label='Training acc')\n", + "plt.plot(epochs, val_acc, 'b', label='Validation acc')\n", + "plt.title('Training and validation accuracy')\n", + "plt.xlabel('Epochs')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(loc='lower right')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hFFyCuJoXy7r" + }, + "source": [ + "在该图表中,虚线代表训练损失和准确率,实线代表验证损失和准确率。\n", + "\n", + "请注意,训练损失会逐周期*下降*,而训练准确率则逐周期*上升*。使用梯度下降优化时,这是预期结果,它应该在每次迭代中最大限度减少所需的数量。\n", + "\n", + "但是,对于验证损失和准确率来说则不然——它们似乎会在训练转确率之前达到顶点。这是过拟合的一个例子:模型在训练数据上的表现要好于在之前从未见过的数据上的表现。经过这一点之后,模型会过度优化和学习*特定*于训练数据的表示,但无法*泛化*到测试数据。\n", + "\n", + "对于这种特殊情况,您可以通过在验证准确率不再增加时直接停止训练来防止过度拟合。一种方式是使用 `tf.keras.callbacks.EarlyStopping` 回调。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-to23J3Vy5d3" + }, + "source": [ + "## 导出模型\n", + "\n", + "在上面的代码中,您在向模型馈送文本之前对数据集应用了 `TextVectorization`。 如果您想让模型能够处理原始字符串(例如,为了简化部署),您可以在模型中包含 `TextVectorization` 层。为此,您可以使用刚刚训练的权重创建一个新模型。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FWXsMvryuZuq" + }, + "outputs": [], + "source": [ + "export_model = tf.keras.Sequential([\n", + " vectorize_layer,\n", + " model,\n", + " layers.Activation('sigmoid')\n", + "])\n", + "\n", + "export_model.compile(\n", + " loss=losses.BinaryCrossentropy(from_logits=False), optimizer=\"adam\", metrics=['accuracy']\n", + ")\n", + "\n", + "# Test it with `raw_test_ds`, which yields raw strings\n", + "loss, accuracy = export_model.evaluate(raw_test_ds)\n", + "print(accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TwQgoN88LoEF" + }, + "source": [ + "### 使用新数据进行推断\n", + "\n", + "要获得对新样本的预测,只需调用 `model.predict()` 即可。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QW355HH5L49K" + }, + "outputs": [], + "source": [ + "examples = [\n", + " \"The movie was great!\",\n", + " \"The movie was okay.\",\n", + " \"The movie was terrible...\"\n", + "]\n", + "\n", + "export_model.predict(examples)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MaxlpFWpzR6c" + }, + "source": [ + "将文本预处理逻辑包含在模型中后,您可以导出用于生产的模型,从而简化部署并降低[训练/测试偏差](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)的可能性。\n", + "\n", + "在选择应用 TextVectorization 层的位置时,需要注意性能差异。在模型之外使用它可以让您在 GPU 上训练时进行异步 CPU 处理和数据缓冲。因此,如果您在 GPU 上训练模型,您应该在开发模型时使用此选项以获得最佳性能,然后在准备好部署时进行切换,在模型中包含 TextVectorization 层。\n", + "\n", + "请参阅此[教程](https://tensorflow.google.cn/tutorials/keras/save_and_load),详细了解如何保存模型。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eSSuci_6nCEG" + }, + "source": [ + "## 练习:对 Stack Overflow 问题进行多类分类\n", + "\n", + "本教程展示了如何在 IMDB 数据集上从头开始训练二元分类器。作为练习,您可以修改此笔记本以训练多类分类器来预测 [Stack Overflow](http://stackoverflow.com/) 上的编程问题的标签。\n", + "\n", + "我们已经准备好了一个[数据集](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz)供您使用,其中包含了几千个发布在 Stack Overflow 上的编程问题(例如,\"How can sort a dictionary by value in Python?\")。每一个问题都只有一个标签(Python、CSharp、JavaScript 或 Java)。您的任务是将问题作为输入,并预测适当的标签,在本例中为 Python。\n", + "\n", + "您将使用的数据集包含从 [BigQuery](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow) 上更大的公共 Stack Overflow 数据集提取的数千个问题,其中包含超过 1700 万个帖子。\n", + "\n", + "下载数据集后,您会发现它与您之前使用的 IMDB 数据集具有相似的目录结构:\n", + "\n", + "```\n", + "train/\n", + "...python/\n", + "......0.txt\n", + "......1.txt\n", + "...javascript/\n", + "......0.txt\n", + "......1.txt\n", + "...csharp/\n", + "......0.txt\n", + "......1.txt\n", + "...java/\n", + "......0.txt\n", + "......1.txt\n", + "```\n", + "\n", + "注:为了增加分类问题的难度,编程问题中出现的 Python、CSharp、JavaScript 或 Java 等词已被替换为 *blank*(因为许多问题都包含它们所涉及的语言)。\n", + "\n", + "要完成此练习,您应该对此笔记本进行以下修改以使用 Stack Overflow 数据集:\n", + "\n", + "1. 在笔记本顶部,将下载 IMDB 数据集的代码更新为下载前面准备好的 [Stack Overflow 数据集](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz)的代码。由于 Stack Overflow 数据集具有类似的目录结构,因此您不需要进行太多修改。\n", + "\n", + "2. 将模型的最后一层修改为 `Dense(4)`,因为现在有四个输出类。\n", + "\n", + "3. 编译模型时,将损失更改为 `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)`。当每个类的标签是整数(在本例中,它们可以是 0、*1*、*2* 或 *3*)时,这是用于多类分类问题的正确损失函数。 此外,将指标更改为 `metrics=['accuracy']`,因为这是一个多类分类问题(`tf.metrics.BinaryAccuracy` 仅用于二元分类器 )。\n", + "\n", + "4. 在绘制随时间变化的准确率时,请将 `binary_accuracy` 和 `val_binary_accuracy` 分别更改为 `accuracy` 和 `val_accuracy`。\n", + "\n", + "5. 完成这些更改后,就可以训练多类分类器了。 " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F0T5SIwSm7uc" + }, + "source": [ + "## 了解更多信息\n", + "\n", + "本教程从头开始介绍了文本分类。要详细了解一般的文本分类工作流程,请查看 Google Developers 提供的[文本分类指南](https://developers.google.com/machine-learning/guides/text-classification/)。\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "text_classification.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 } From f443e594387afdb0361431fd90cdb5c52f3e7d1c Mon Sep 17 00:00:00 2001 From: Lekton ZHANG <869944+takato3000@users.noreply.github.com> Date: Wed, 4 Dec 2024 06:24:24 +0800 Subject: [PATCH 2/2] Also Update English Version of tutorial. --- .../tutorials/keras/text_classification.ipynb | 1961 +++++++++-------- 1 file changed, 981 insertions(+), 980 deletions(-) diff --git a/site/en-snapshot/tutorials/keras/text_classification.ipynb b/site/en-snapshot/tutorials/keras/text_classification.ipynb index 4182c3f295..94390105e2 100644 --- a/site/en-snapshot/tutorials/keras/text_classification.ipynb +++ b/site/en-snapshot/tutorials/keras/text_classification.ipynb @@ -1,982 +1,983 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "Ic4_occAAiAT" - }, - "source": [ - "##### Copyright 2019 The TensorFlow Authors." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "ioaprt5q5US7" - }, - "outputs": [], - "source": [ - "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "cellView": "form", - "id": "yCl0eTNH5RS3" - }, - "outputs": [], - "source": [ - "#@title MIT License\n", - "#\n", - "# Copyright (c) 2017 François Chollet\n", - "#\n", - "# Permission is hereby granted, free of charge, to any person obtaining a\n", - "# copy of this software and associated documentation files (the \"Software\"),\n", - "# to deal in the Software without restriction, including without limitation\n", - "# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n", - "# and/or sell copies of the Software, and to permit persons to whom the\n", - "# Software is furnished to do so, subject to the following conditions:\n", - "#\n", - "# The above copyright notice and this permission notice shall be included in\n", - "# all copies or substantial portions of the Software.\n", - "#\n", - "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", - "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", - "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n", - "# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", - "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n", - "# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n", - "# DEALINGS IN THE SOFTWARE." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ItXfxkxvosLH" - }, - "source": [ - "# Basic text classification" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hKY4XMc9o8iB" - }, - "source": [ - "\n", - " \n", - " \n", - " \n", - " \n", - "
\n", - " View on TensorFlow.org\n", - " \n", - " Run in Google Colab\n", - " \n", - " View source on GitHub\n", - " \n", - " Download notebook\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Eg62Pmz3o83v" - }, - "source": [ - "This tutorial demonstrates text classification starting from plain text files stored on disk. You'll train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try, in which you'll train a multi-class classifier to predict the tag for a programming question on Stack Overflow.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "8RZOuS9LWQvv" - }, - "outputs": [], - "source": [ - "import matplotlib.pyplot as plt\n", - "import os\n", - "import re\n", - "import shutil\n", - "import string\n", - "import tensorflow as tf\n", - "\n", - "from tensorflow.keras import layers\n", - "from tensorflow.keras import losses\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6-tTFS04dChr" - }, - "outputs": [], - "source": [ - "print(tf.__version__)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NBTI1bi8qdFV" - }, - "source": [ - "## Sentiment analysis\n", - "\n", - "This notebook trains a sentiment analysis model to classify movie reviews as *positive* or *negative*, based on the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem.\n", - "\n", - "You'll use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iAsKG535pHep" - }, - "source": [ - "### Download and explore the IMDB dataset\n", - "\n", - "Let's download and extract the dataset, then explore the directory structure." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "k7ZYnuajVlFN" - }, - "outputs": [], - "source": [ - "url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n", - "\n", - "dataset = tf.keras.utils.get_file(\"aclImdb_v1\", url,\n", - " untar=True, cache_dir='.',\n", - " cache_subdir='')\n", - "\n", - "dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "355CfOvsV1pl" - }, - "outputs": [], - "source": [ - "os.listdir(dataset_dir)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7ASND15oXpF1" - }, - "outputs": [], - "source": [ - "train_dir = os.path.join(dataset_dir, 'train')\n", - "os.listdir(train_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ysMNMI1CWDFD" - }, - "source": [ - "The `aclImdb/train/pos` and `aclImdb/train/neg` directories contain many text files, each of which is a single movie review. Let's take a look at one of them." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "R7g8hFvzWLIZ" - }, - "outputs": [], - "source": [ - "sample_file = os.path.join(train_dir, 'pos/1181_9.txt')\n", - "with open(sample_file) as f:\n", - " print(f.read())" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mk20TEm6ZRFP" - }, - "source": [ - "### Load the dataset\n", - "\n", - "Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use the helpful [text_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory) utility, which expects a directory structure as follows.\n", - "\n", - "```\n", - "main_directory/\n", - "...class_a/\n", - "......a_text_1.txt\n", - "......a_text_2.txt\n", - "...class_b/\n", - "......b_text_1.txt\n", - "......b_text_2.txt\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nQauv38Lnok3" - }, - "source": [ - "To prepare a dataset for binary classification, you will need two folders on disk, corresponding to `class_a` and `class_b`. These will be the positive and negative movie reviews, which can be found in `aclImdb/train/pos` and `aclImdb/train/neg`. As the IMDB dataset contains additional folders, you will remove them before using this utility." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VhejsClzaWfl" - }, - "outputs": [], - "source": [ - "remove_dir = os.path.join(train_dir, 'unsup')\n", - "shutil.rmtree(remove_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "95kkUdRoaeMw" - }, - "source": [ - "Next, you will use the `text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`. [tf.data](https://www.tensorflow.org/guide/data) is a powerful collection of tools for working with data. \n", - "\n", - "When running a machine learning experiment, it is a best practice to divide your dataset into three splits: [train](https://developers.google.com/machine-learning/glossary#training_set), [validation](https://developers.google.com/machine-learning/glossary#validation_set), and [test](https://developers.google.com/machine-learning/glossary#test-set). \n", - "\n", - "The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the `validation_split` argument below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nOrK-MTYaw3C" - }, - "outputs": [], - "source": [ - "batch_size = 32\n", - "seed = 42\n", - "\n", - "raw_train_ds = tf.keras.utils.text_dataset_from_directory(\n", - " 'aclImdb/train', \n", - " batch_size=batch_size, \n", - " validation_split=0.2, \n", - " subset='training', \n", - " seed=seed)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5Y33oxOUpYkh" - }, - "source": [ - "As you can see above, there are 25,000 examples in the training folder, of which you will use 80% (or 20,000) for training. As you will see in a moment, you can train a model by passing a dataset directly to `model.fit`. If you're new to `tf.data`, you can also iterate over the dataset and print out a few examples as follows." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "51wNaPPApk1K" - }, - "outputs": [], - "source": [ - "for text_batch, label_batch in raw_train_ds.take(1):\n", - " for i in range(3):\n", - " print(\"Review\", text_batch.numpy()[i])\n", - " print(\"Label\", label_batch.numpy()[i])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JWq1SUIrp1a-" - }, - "source": [ - "Notice the reviews contain raw text (with punctuation and occasional HTML tags like `
`). You will show how to handle these in the following section. \n", - "\n", - "The labels are 0 or 1. To see which of these correspond to positive and negative movie reviews, you can check the `class_names` property on the dataset.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MlICTG8spyO2" - }, - "outputs": [], - "source": [ - "print(\"Label 0 corresponds to\", raw_train_ds.class_names[0])\n", - "print(\"Label 1 corresponds to\", raw_train_ds.class_names[1])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pbdO39vYqdJr" - }, - "source": [ - "Next, you will create a validation and test dataset. You will use the remaining 5,000 reviews from the training set for validation." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SzxazN8Hq1pF" - }, - "source": [ - "Note: When using the `validation_split` and `subset` arguments, make sure to either specify a random seed, or to pass `shuffle=False`, so that the validation and training splits have no overlap." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JsMwwhOoqjKF" - }, - "outputs": [], - "source": [ - "raw_val_ds = tf.keras.utils.text_dataset_from_directory(\n", - " 'aclImdb/train', \n", - " batch_size=batch_size, \n", - " validation_split=0.2, \n", - " subset='validation', \n", - " seed=seed)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "rdSr0Nt3q_ns" - }, - "outputs": [], - "source": [ - "raw_test_ds = tf.keras.utils.text_dataset_from_directory(\n", - " 'aclImdb/test', \n", - " batch_size=batch_size)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qJmTiO0IYAjm" - }, - "source": [ - "### Prepare the dataset for training\n", - "\n", - "Next, you will standardize, tokenize, and vectorize the data using the helpful `tf.keras.layers.TextVectorization` layer. \n", - "\n", - "Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset. Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words, by splitting on whitespace). Vectorization refers to converting tokens into numbers so they can be fed into a neural network. All of these tasks can be accomplished with this layer.\n", - "\n", - "As you saw above, the reviews contain various HTML tags like `
`. These tags will not be removed by the default standardizer in the `TextVectorization` layer (which converts text to lowercase and strips punctuation by default, but doesn't strip HTML). You will write a custom standardization function to remove the HTML." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZVcHl-SLrH-u" - }, - "source": [ - "Note: To prevent [training-testing skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew) (also known as training-serving skew), it is important to preprocess the data identically at train and test time. To facilitate this, the `TextVectorization` layer can be included directly inside your model, as shown later in this tutorial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SDRI_s_tX1Hk" - }, - "outputs": [], - "source": [ - "def custom_standardization(input_data):\n", - " lowercase = tf.strings.lower(input_data)\n", - " stripped_html = tf.strings.regex_replace(lowercase, '
', ' ')\n", - " return tf.strings.regex_replace(stripped_html,\n", - " '[%s]' % re.escape(string.punctuation),\n", - " '')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d2d3Aw8dsUux" - }, - "source": [ - "Next, you will create a `TextVectorization` layer. You will use this layer to standardize, tokenize, and vectorize our data. You set the `output_mode` to `int` to create unique integer indices for each token.\n", - "\n", - "Note that you're using the default split function, and the custom standardization function you defined above. You'll also define some constants for the model, like an explicit maximum `sequence_length`, which will cause the layer to pad or truncate sequences to exactly `sequence_length` values." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-c76RvSzsMnX" - }, - "outputs": [], - "source": [ - "max_features = 10000\n", - "sequence_length = 250\n", - "\n", - "vectorize_layer = layers.TextVectorization(\n", - " standardize=custom_standardization,\n", - " max_tokens=max_features,\n", - " output_mode='int',\n", - " output_sequence_length=sequence_length)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "vlFOpfF6scT6" - }, - "source": [ - "Next, you will call `adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lAhdjK7AtroA" - }, - "source": [ - "Note: It's important to only use your training data when calling adapt (using the test set would leak information)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "GH4_2ZGJsa_X" - }, - "outputs": [], - "source": [ - "# Make a text-only dataset (without labels), then call adapt\n", - "train_text = raw_train_ds.map(lambda x, y: x)\n", - "vectorize_layer.adapt(train_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SHQVEFzNt-K_" - }, - "source": [ - "Let's create a function to see the result of using this layer to preprocess some data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "SCIg_T50wOCU" - }, - "outputs": [], - "source": [ - "def vectorize_text(text, label):\n", - " text = tf.expand_dims(text, -1)\n", - " return vectorize_layer(text), label" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XULcm6B3xQIO" - }, - "outputs": [], - "source": [ - "# retrieve a batch (of 32 reviews and labels) from the dataset\n", - "text_batch, label_batch = next(iter(raw_train_ds))\n", - "first_review, first_label = text_batch[0], label_batch[0]\n", - "print(\"Review\", first_review)\n", - "print(\"Label\", raw_train_ds.class_names[first_label])\n", - "print(\"Vectorized review\", vectorize_text(first_review, first_label))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6u5EX0hxyNZT" - }, - "source": [ - "As you can see above, each token has been replaced by an integer. You can lookup the token (string) that each integer corresponds to by calling `.get_vocabulary()` on the layer." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kRq9hTQzhVhW" - }, - "outputs": [], - "source": [ - "print(\"1287 ---> \",vectorize_layer.get_vocabulary()[1287])\n", - "print(\" 313 ---> \",vectorize_layer.get_vocabulary()[313])\n", - "print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XD2H6utRydGv" - }, - "source": [ - "You are nearly ready to train your model. As a final preprocessing step, you will apply the TextVectorization layer you created earlier to the train, validation, and test dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2zhmpeViI1iG" - }, - "outputs": [], - "source": [ - "train_ds = raw_train_ds.map(vectorize_text)\n", - "val_ds = raw_val_ds.map(vectorize_text)\n", - "test_ds = raw_test_ds.map(vectorize_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YsVQyPMizjuO" - }, - "source": [ - "### Configure the dataset for performance\n", - "\n", - "These are two important methods you should use when loading data to make sure that I/O does not become blocking.\n", - "\n", - "`.cache()` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.\n", - "\n", - "`.prefetch()` overlaps data preprocessing and model execution while training. \n", - "\n", - "You can learn more about both methods, as well as how to cache data to disk in the [data performance guide](https://www.tensorflow.org/guide/data_performance)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wMcs_H7izm5m" - }, - "outputs": [], - "source": [ - "AUTOTUNE = tf.data.AUTOTUNE\n", - "\n", - "train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", - "val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", - "test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LLC02j2g-llC" - }, - "source": [ - "### Create the model\n", - "\n", - "It's time to create your neural network:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "dkQP6in8yUBR" - }, - "outputs": [], - "source": [ - "embedding_dim = 16" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "xpKOoWgu-llD" - }, - "outputs": [], - "source": [ - "model = tf.keras.Sequential([\n", - " layers.Embedding(max_features + 1, embedding_dim),\n", - " layers.Dropout(0.2),\n", - " layers.GlobalAveragePooling1D(),\n", - " layers.Dropout(0.2),\n", - " layers.Dense(1)])\n", - "\n", - "model.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6PbKQ6mucuKL" - }, - "source": [ - "The layers are stacked sequentially to build the classifier:\n", - "\n", - "1. The first layer is an `Embedding` layer. This layer takes the integer-encoded reviews and looks up an embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`. To learn more about embeddings, check out the [Word embeddings](https://www.tensorflow.org/text/guide/word_embeddings) tutorial.\n", - "2. Next, a `GlobalAveragePooling1D` layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.\n", - "3. The last layer is densely connected with a single output node." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L4EqVWg4-llM" - }, - "source": [ - "### Loss function and optimizer\n", - "\n", - "A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), you'll use `losses.BinaryCrossentropy` loss function.\n", - "\n", - "Now, configure the model to use an optimizer and a loss function:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Mr0GP-cQ-llN" - }, - "outputs": [], - "source": [ - "model.compile(loss=losses.BinaryCrossentropy(from_logits=True),\n", - " optimizer='adam',\n", - " metrics=tf.metrics.BinaryAccuracy(threshold=0.0))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "35jv_fzP-llU" - }, - "source": [ - "### Train the model\n", - "\n", - "You will train the model by passing the `dataset` object to the fit method." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tXSGrjWZ-llW" - }, - "outputs": [], - "source": [ - "epochs = 10\n", - "history = model.fit(\n", - " train_ds,\n", - " validation_data=val_ds,\n", - " epochs=epochs)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9EEGuDVuzb5r" - }, - "source": [ - "### Evaluate the model\n", - "\n", - "Let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zOMKywn4zReN" - }, - "outputs": [], - "source": [ - "loss, accuracy = model.evaluate(test_ds)\n", - "\n", - "print(\"Loss: \", loss)\n", - "print(\"Accuracy: \", accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "z1iEXVTR0Z2t" - }, - "source": [ - "This fairly naive approach achieves an accuracy of about 86%." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ldbQqCw2Xc1W" - }, - "source": [ - "### Create a plot of accuracy and loss over time\n", - "\n", - "`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-YcvZsdvWfDf" - }, - "outputs": [], - "source": [ - "history_dict = history.history\n", - "history_dict.keys()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1_CH32qJXruI" - }, - "source": [ - "There are four entries: one for each monitored metric during training and validation. You can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2SEMeQ5YXs8z" - }, - "outputs": [], - "source": [ - "acc = history_dict['binary_accuracy']\n", - "val_acc = history_dict['val_binary_accuracy']\n", - "loss = history_dict['loss']\n", - "val_loss = history_dict['val_loss']\n", - "\n", - "epochs = range(1, len(acc) + 1)\n", - "\n", - "# \"bo\" is for \"blue dot\"\n", - "plt.plot(epochs, loss, 'bo', label='Training loss')\n", - "# b is for \"solid blue line\"\n", - "plt.plot(epochs, val_loss, 'b', label='Validation loss')\n", - "plt.title('Training and validation loss')\n", - "plt.xlabel('Epochs')\n", - "plt.ylabel('Loss')\n", - "plt.legend()\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Z3PJemLPXwz_" - }, - "outputs": [], - "source": [ - "plt.plot(epochs, acc, 'bo', label='Training acc')\n", - "plt.plot(epochs, val_acc, 'b', label='Validation acc')\n", - "plt.title('Training and validation accuracy')\n", - "plt.xlabel('Epochs')\n", - "plt.ylabel('Accuracy')\n", - "plt.legend(loc='lower right')\n", - "\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hFFyCuJoXy7r" - }, - "source": [ - "In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.\n", - "\n", - "Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.\n", - "\n", - "This isn't the case for the validation loss and accuracy—they seem to peak before the training accuracy. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.\n", - "\n", - "For this particular case, you could prevent overfitting by simply stopping the training when the validation accuracy is no longer increasing. One way to do so is to use the `tf.keras.callbacks.EarlyStopping` callback." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-to23J3Vy5d3" - }, - "source": [ - "## Export the model\n", - "\n", - "In the code above, you applied the `TextVectorization` layer to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the `TextVectorization` layer inside your model. To do so, you can create a new model using the weights you just trained." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "FWXsMvryuZuq" - }, - "outputs": [], - "source": [ - "export_model = tf.keras.Sequential([\n", - " vectorize_layer,\n", - " model,\n", - " layers.Activation('sigmoid')\n", - "])\n", - "\n", - "export_model.compile(\n", - " loss=losses.BinaryCrossentropy(from_logits=False), optimizer=\"adam\", metrics=['accuracy']\n", - ")\n", - "\n", - "# Test it with `raw_test_ds`, which yields raw strings\n", - "loss, accuracy = export_model.evaluate(raw_test_ds)\n", - "print(accuracy)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TwQgoN88LoEF" - }, - "source": [ - "### Inference on new data\n", - "\n", - "To get predictions for new examples, you can simply call `model.predict()`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QW355HH5L49K" - }, - "outputs": [], - "source": [ - "examples = [\n", - " \"The movie was great!\",\n", - " \"The movie was okay.\",\n", - " \"The movie was terrible...\"\n", - "]\n", - "\n", - "export_model.predict(examples)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MaxlpFWpzR6c" - }, - "source": [ - "Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for [train/test skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew).\n", - "\n", - "There is a performance difference to keep in mind when choosing where to apply your TextVectorization layer. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization layer inside your model when you're ready to prepare for deployment.\n", - "\n", - "Visit this [tutorial](https://www.tensorflow.org/tutorials/keras/save_and_load) to learn more about saving models." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "eSSuci_6nCEG" - }, - "source": [ - "## Exercise: multi-class classification on Stack Overflow questions\n", - "\n", - "This tutorial showed how to train a binary classifier from scratch on the IMDB dataset. As an exercise, you can modify this notebook to train a multi-class classifier to predict the tag of a programming question on [Stack Overflow](http://stackoverflow.com/).\n", - "\n", - "A [dataset](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) has been prepared for you to use containing the body of several thousand programming questions (for example, \"How can I sort a dictionary by value in Python?\") posted to Stack Overflow. Each of these is labeled with exactly one tag (either Python, CSharp, JavaScript, or Java). Your task is to take a question as input, and predict the appropriate tag, in this case, Python. \n", - "\n", - "The dataset you will work with contains several thousand questions extracted from the much larger public Stack Overflow dataset on [BigQuery](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow), which contains more than 17 million posts.\n", - "\n", - "After downloading the dataset, you will find it has a similar directory structure to the IMDB dataset you worked with previously:\n", - "\n", - "```\n", - "train/\n", - "...python/\n", - "......0.txt\n", - "......1.txt\n", - "...javascript/\n", - "......0.txt\n", - "......1.txt\n", - "...csharp/\n", - "......0.txt\n", - "......1.txt\n", - "...java/\n", - "......0.txt\n", - "......1.txt\n", - "```\n", - "\n", - "Note: To increase the difficulty of the classification problem, occurrences of the words Python, CSharp, JavaScript, or Java in the programming questions have been replaced with the word *blank* (as many questions contain the language they're about).\n", - "\n", - "To complete this exercise, you should modify this notebook to work with the Stack Overflow dataset by making the following modifications:\n", - "\n", - "1. At the top of your notebook, update the code that downloads the IMDB dataset with code to download the [Stack Overflow dataset](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) that has already been prepared. As the Stack Overflow dataset has a similar directory structure, you will not need to make many modifications.\n", - "\n", - "1. Modify the last layer of your model to `Dense(4)`, as there are now four output classes.\n", - "\n", - "1. When compiling the model, change the loss to `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)`. This is the correct loss function to use for a multi-class classification problem, when the labels for each class are integers (in this case, they can be 0, *1*, *2*, or *3*). In addition, change the metrics to `metrics=['accuracy']`, since this is a multi-class classification problem (`tf.metrics.BinaryAccuracy` is only used for binary classifiers).\n", - "\n", - "1. When plotting accuracy over time, change `binary_accuracy` and `val_binary_accuracy` to `accuracy` and `val_accuracy`, respectively.\n", - "\n", - "1. Once these changes are complete, you will be able to train a multi-class classifier. " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "F0T5SIwSm7uc" - }, - "source": [ - "## Learning more\n", - "\n", - "This tutorial introduced text classification from scratch. To learn more about the text classification workflow in general, check out the [Text classification guide](https://developers.google.com/machine-learning/guides/text-classification/) from Google Developers.\n" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "text_classification.ipynb", - "toc_visible": true - }, - "kernelspec": { - "display_name": "Python 3", - "name": "python3" - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Ic4_occAAiAT" + }, + "source": [ + "##### Copyright 2019 The TensorFlow Authors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ioaprt5q5US7" + }, + "outputs": [], + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "yCl0eTNH5RS3" + }, + "outputs": [], + "source": [ + "#@title MIT License\n", + "#\n", + "# Copyright (c) 2017 François Chollet\n", + "#\n", + "# Permission is hereby granted, free of charge, to any person obtaining a\n", + "# copy of this software and associated documentation files (the \"Software\"),\n", + "# to deal in the Software without restriction, including without limitation\n", + "# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n", + "# and/or sell copies of the Software, and to permit persons to whom the\n", + "# Software is furnished to do so, subject to the following conditions:\n", + "#\n", + "# The above copyright notice and this permission notice shall be included in\n", + "# all copies or substantial portions of the Software.\n", + "#\n", + "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", + "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", + "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n", + "# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", + "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n", + "# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n", + "# DEALINGS IN THE SOFTWARE." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ItXfxkxvosLH" + }, + "source": [ + "# Basic text classification" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hKY4XMc9o8iB" + }, + "source": [ + "\n", + " \n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View source on GitHub\n", + " \n", + " Download notebook\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Eg62Pmz3o83v" + }, + "source": [ + "This tutorial demonstrates text classification starting from plain text files stored on disk. You'll train a binary classifier to perform sentiment analysis on an IMDB dataset. At the end of the notebook, there is an exercise for you to try, in which you'll train a multi-class classifier to predict the tag for a programming question on Stack Overflow.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8RZOuS9LWQvv" + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import os\n", + "import re\n", + "import shutil\n", + "import string\n", + "import tensorflow as tf\n", + "\n", + "from tensorflow.keras import layers\n", + "from tensorflow.keras import losses\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6-tTFS04dChr" + }, + "outputs": [], + "source": [ + "print(tf.__version__)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NBTI1bi8qdFV" + }, + "source": [ + "## Sentiment analysis\n", + "\n", + "This notebook trains a sentiment analysis model to classify movie reviews as *positive* or *negative*, based on the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem.\n", + "\n", + "You'll use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iAsKG535pHep" + }, + "source": [ + "### Download and explore the IMDB dataset\n", + "\n", + "Let's download and extract the dataset, then explore the directory structure.Use dataset_dir = os.path.join(dataset, 'aclImdb') if you have path issues on Windows Platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k7ZYnuajVlFN" + }, + "outputs": [], + "source": [ + "url = \"https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\"\n", + "file_hash = \"c40f74a18d3b61f90feba1e17730e0d38e8b97c05fde7008942e91923d1658fe\"\n", + "\n", + "dataset = tf.keras.utils.get_file(fname=\"aclImdb_v1\", origin=url,\n", + " extract=True, cache_dir='.',\n", + " cache_subdir='', file_hash=file_hash, hash_algorithm='sha256')\n", + "\n", + "dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "355CfOvsV1pl" + }, + "outputs": [], + "source": [ + "os.listdir(dataset_dir)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7ASND15oXpF1" + }, + "outputs": [], + "source": [ + "train_dir = os.path.join(dataset_dir, 'train')\n", + "os.listdir(train_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ysMNMI1CWDFD" + }, + "source": [ + "The `aclImdb/train/pos` and `aclImdb/train/neg` directories contain many text files, each of which is a single movie review. Let's take a look at one of them." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R7g8hFvzWLIZ" + }, + "outputs": [], + "source": [ + "sample_file = os.path.join(train_dir, 'pos/1181_9.txt')\n", + "with open(sample_file) as f:\n", + " print(f.read())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mk20TEm6ZRFP" + }, + "source": [ + "### Load the dataset\n", + "\n", + "Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use the helpful [text_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory) utility, which expects a directory structure as follows.\n", + "\n", + "```\n", + "main_directory/\n", + "...class_a/\n", + "......a_text_1.txt\n", + "......a_text_2.txt\n", + "...class_b/\n", + "......b_text_1.txt\n", + "......b_text_2.txt\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nQauv38Lnok3" + }, + "source": [ + "To prepare a dataset for binary classification, you will need two folders on disk, corresponding to `class_a` and `class_b`. These will be the positive and negative movie reviews, which can be found in `aclImdb/train/pos` and `aclImdb/train/neg`. As the IMDB dataset contains additional folders, you will remove them before using this utility." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VhejsClzaWfl" + }, + "outputs": [], + "source": [ + "remove_dir = os.path.join(train_dir, 'unsup')\n", + "shutil.rmtree(remove_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "95kkUdRoaeMw" + }, + "source": [ + "Next, you will use the `text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`. [tf.data](https://www.tensorflow.org/guide/data) is a powerful collection of tools for working with data. \n", + "\n", + "When running a machine learning experiment, it is a best practice to divide your dataset into three splits: [train](https://developers.google.com/machine-learning/glossary#training_set), [validation](https://developers.google.com/machine-learning/glossary#validation_set), and [test](https://developers.google.com/machine-learning/glossary#test-set). \n", + "\n", + "The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the `validation_split` argument below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nOrK-MTYaw3C" + }, + "outputs": [], + "source": [ + "batch_size = 32\n", + "seed = 42\n", + "\n", + "raw_train_ds = tf.keras.utils.text_dataset_from_directory(\n", + " 'aclImdb/train', \n", + " batch_size=batch_size, \n", + " validation_split=0.2, \n", + " subset='training', \n", + " seed=seed)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5Y33oxOUpYkh" + }, + "source": [ + "As you can see above, there are 25,000 examples in the training folder, of which you will use 80% (or 20,000) for training. As you will see in a moment, you can train a model by passing a dataset directly to `model.fit`. If you're new to `tf.data`, you can also iterate over the dataset and print out a few examples as follows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "51wNaPPApk1K" + }, + "outputs": [], + "source": [ + "for text_batch, label_batch in raw_train_ds.take(1):\n", + " for i in range(3):\n", + " print(\"Review\", text_batch.numpy()[i])\n", + " print(\"Label\", label_batch.numpy()[i])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JWq1SUIrp1a-" + }, + "source": [ + "Notice the reviews contain raw text (with punctuation and occasional HTML tags like `
`). You will show how to handle these in the following section. \n", + "\n", + "The labels are 0 or 1. To see which of these correspond to positive and negative movie reviews, you can check the `class_names` property on the dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MlICTG8spyO2" + }, + "outputs": [], + "source": [ + "print(\"Label 0 corresponds to\", raw_train_ds.class_names[0])\n", + "print(\"Label 1 corresponds to\", raw_train_ds.class_names[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pbdO39vYqdJr" + }, + "source": [ + "Next, you will create a validation and test dataset. You will use the remaining 5,000 reviews from the training set for validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SzxazN8Hq1pF" + }, + "source": [ + "Note: When using the `validation_split` and `subset` arguments, make sure to either specify a random seed, or to pass `shuffle=False`, so that the validation and training splits have no overlap." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JsMwwhOoqjKF" + }, + "outputs": [], + "source": [ + "raw_val_ds = tf.keras.utils.text_dataset_from_directory(\n", + " 'aclImdb/train', \n", + " batch_size=batch_size, \n", + " validation_split=0.2, \n", + " subset='validation', \n", + " seed=seed)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rdSr0Nt3q_ns" + }, + "outputs": [], + "source": [ + "raw_test_ds = tf.keras.utils.text_dataset_from_directory(\n", + " 'aclImdb/test', \n", + " batch_size=batch_size)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qJmTiO0IYAjm" + }, + "source": [ + "### Prepare the dataset for training\n", + "\n", + "Next, you will standardize, tokenize, and vectorize the data using the helpful `tf.keras.layers.TextVectorization` layer. \n", + "\n", + "Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset. Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words, by splitting on whitespace). Vectorization refers to converting tokens into numbers so they can be fed into a neural network. All of these tasks can be accomplished with this layer.\n", + "\n", + "As you saw above, the reviews contain various HTML tags like `
`. These tags will not be removed by the default standardizer in the `TextVectorization` layer (which converts text to lowercase and strips punctuation by default, but doesn't strip HTML). You will write a custom standardization function to remove the HTML." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZVcHl-SLrH-u" + }, + "source": [ + "Note: To prevent [training-testing skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew) (also known as training-serving skew), it is important to preprocess the data identically at train and test time. To facilitate this, the `TextVectorization` layer can be included directly inside your model, as shown later in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SDRI_s_tX1Hk" + }, + "outputs": [], + "source": [ + "def custom_standardization(input_data):\n", + " lowercase = tf.strings.lower(input_data)\n", + " stripped_html = tf.strings.regex_replace(lowercase, '
', ' ')\n", + " return tf.strings.regex_replace(stripped_html,\n", + " '[%s]' % re.escape(string.punctuation),\n", + " '')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d2d3Aw8dsUux" + }, + "source": [ + "Next, you will create a `TextVectorization` layer. You will use this layer to standardize, tokenize, and vectorize our data. You set the `output_mode` to `int` to create unique integer indices for each token.\n", + "\n", + "Note that you're using the default split function, and the custom standardization function you defined above. You'll also define some constants for the model, like an explicit maximum `sequence_length`, which will cause the layer to pad or truncate sequences to exactly `sequence_length` values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-c76RvSzsMnX" + }, + "outputs": [], + "source": [ + "max_features = 10000\n", + "sequence_length = 250\n", + "\n", + "vectorize_layer = layers.TextVectorization(\n", + " standardize=custom_standardization,\n", + " max_tokens=max_features,\n", + " output_mode='int',\n", + " output_sequence_length=sequence_length)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vlFOpfF6scT6" + }, + "source": [ + "Next, you will call `adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lAhdjK7AtroA" + }, + "source": [ + "Note: It's important to only use your training data when calling adapt (using the test set would leak information)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GH4_2ZGJsa_X" + }, + "outputs": [], + "source": [ + "# Make a text-only dataset (without labels), then call adapt\n", + "train_text = raw_train_ds.map(lambda x, y: x)\n", + "vectorize_layer.adapt(train_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SHQVEFzNt-K_" + }, + "source": [ + "Let's create a function to see the result of using this layer to preprocess some data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SCIg_T50wOCU" + }, + "outputs": [], + "source": [ + "def vectorize_text(text, label):\n", + " text = tf.expand_dims(text, -1)\n", + " return vectorize_layer(text), label" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XULcm6B3xQIO" + }, + "outputs": [], + "source": [ + "# retrieve a batch (of 32 reviews and labels) from the dataset\n", + "text_batch, label_batch = next(iter(raw_train_ds))\n", + "first_review, first_label = text_batch[0], label_batch[0]\n", + "print(\"Review\", first_review)\n", + "print(\"Label\", raw_train_ds.class_names[first_label])\n", + "print(\"Vectorized review\", vectorize_text(first_review, first_label))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6u5EX0hxyNZT" + }, + "source": [ + "As you can see above, each token has been replaced by an integer. You can lookup the token (string) that each integer corresponds to by calling `.get_vocabulary()` on the layer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kRq9hTQzhVhW" + }, + "outputs": [], + "source": [ + "print(\"1287 ---> \",vectorize_layer.get_vocabulary()[1287])\n", + "print(\" 313 ---> \",vectorize_layer.get_vocabulary()[313])\n", + "print('Vocabulary size: {}'.format(len(vectorize_layer.get_vocabulary())))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XD2H6utRydGv" + }, + "source": [ + "You are nearly ready to train your model. As a final preprocessing step, you will apply the TextVectorization layer you created earlier to the train, validation, and test dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2zhmpeViI1iG" + }, + "outputs": [], + "source": [ + "train_ds = raw_train_ds.map(vectorize_text)\n", + "val_ds = raw_val_ds.map(vectorize_text)\n", + "test_ds = raw_test_ds.map(vectorize_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YsVQyPMizjuO" + }, + "source": [ + "### Configure the dataset for performance\n", + "\n", + "These are two important methods you should use when loading data to make sure that I/O does not become blocking.\n", + "\n", + "`.cache()` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.\n", + "\n", + "`.prefetch()` overlaps data preprocessing and model execution while training. \n", + "\n", + "You can learn more about both methods, as well as how to cache data to disk in the [data performance guide](https://www.tensorflow.org/guide/data_performance)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wMcs_H7izm5m" + }, + "outputs": [], + "source": [ + "AUTOTUNE = tf.data.AUTOTUNE\n", + "\n", + "train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", + "val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)\n", + "test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LLC02j2g-llC" + }, + "source": [ + "### Create the model\n", + "\n", + "It's time to create your neural network:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dkQP6in8yUBR" + }, + "outputs": [], + "source": [ + "embedding_dim = 16" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xpKOoWgu-llD" + }, + "outputs": [], + "source": [ + "model = tf.keras.Sequential([\n", + " layers.Embedding(max_features + 1, embedding_dim),\n", + " layers.Dropout(0.2),\n", + " layers.GlobalAveragePooling1D(),\n", + " layers.Dropout(0.2),\n", + " layers.Dense(1)])\n", + "\n", + "model.summary()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6PbKQ6mucuKL" + }, + "source": [ + "The layers are stacked sequentially to build the classifier:\n", + "\n", + "1. The first layer is an `Embedding` layer. This layer takes the integer-encoded reviews and looks up an embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`. To learn more about embeddings, check out the [Word embeddings](https://www.tensorflow.org/text/guide/word_embeddings) tutorial.\n", + "2. Next, a `GlobalAveragePooling1D` layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.\n", + "3. The last layer is densely connected with a single output node." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L4EqVWg4-llM" + }, + "source": [ + "### Loss function and optimizer\n", + "\n", + "A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), you'll use `losses.BinaryCrossentropy` loss function.\n", + "\n", + "Now, configure the model to use an optimizer and a loss function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Mr0GP-cQ-llN" + }, + "outputs": [], + "source": [ + "model.compile(loss=losses.BinaryCrossentropy(from_logits=True),\n", + " optimizer='adam',\n", + " metrics=tf.metrics.BinaryAccuracy(threshold=0.0))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "35jv_fzP-llU" + }, + "source": [ + "### Train the model\n", + "\n", + "You will train the model by passing the `dataset` object to the fit method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tXSGrjWZ-llW" + }, + "outputs": [], + "source": [ + "epochs = 10\n", + "history = model.fit(\n", + " train_ds,\n", + " validation_data=val_ds,\n", + " epochs=epochs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9EEGuDVuzb5r" + }, + "source": [ + "### Evaluate the model\n", + "\n", + "Let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zOMKywn4zReN" + }, + "outputs": [], + "source": [ + "loss, accuracy = model.evaluate(test_ds)\n", + "\n", + "print(\"Loss: \", loss)\n", + "print(\"Accuracy: \", accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z1iEXVTR0Z2t" + }, + "source": [ + "This fairly naive approach achieves an accuracy of about 86%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ldbQqCw2Xc1W" + }, + "source": [ + "### Create a plot of accuracy and loss over time\n", + "\n", + "`model.fit()` returns a `History` object that contains a dictionary with everything that happened during training:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-YcvZsdvWfDf" + }, + "outputs": [], + "source": [ + "history_dict = history.history\n", + "history_dict.keys()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1_CH32qJXruI" + }, + "source": [ + "There are four entries: one for each monitored metric during training and validation. You can use these to plot the training and validation loss for comparison, as well as the training and validation accuracy:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2SEMeQ5YXs8z" + }, + "outputs": [], + "source": [ + "acc = history_dict['binary_accuracy']\n", + "val_acc = history_dict['val_binary_accuracy']\n", + "loss = history_dict['loss']\n", + "val_loss = history_dict['val_loss']\n", + "\n", + "epochs = range(1, len(acc) + 1)\n", + "\n", + "# \"bo\" is for \"blue dot\"\n", + "plt.plot(epochs, loss, 'bo', label='Training loss')\n", + "# b is for \"solid blue line\"\n", + "plt.plot(epochs, val_loss, 'b', label='Validation loss')\n", + "plt.title('Training and validation loss')\n", + "plt.xlabel('Epochs')\n", + "plt.ylabel('Loss')\n", + "plt.legend()\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Z3PJemLPXwz_" + }, + "outputs": [], + "source": [ + "plt.plot(epochs, acc, 'bo', label='Training acc')\n", + "plt.plot(epochs, val_acc, 'b', label='Validation acc')\n", + "plt.title('Training and validation accuracy')\n", + "plt.xlabel('Epochs')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(loc='lower right')\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hFFyCuJoXy7r" + }, + "source": [ + "In this plot, the dots represent the training loss and accuracy, and the solid lines are the validation loss and accuracy.\n", + "\n", + "Notice the training loss *decreases* with each epoch and the training accuracy *increases* with each epoch. This is expected when using a gradient descent optimization—it should minimize the desired quantity on every iteration.\n", + "\n", + "This isn't the case for the validation loss and accuracy—they seem to peak before the training accuracy. This is an example of overfitting: the model performs better on the training data than it does on data it has never seen before. After this point, the model over-optimizes and learns representations *specific* to the training data that do not *generalize* to test data.\n", + "\n", + "For this particular case, you could prevent overfitting by simply stopping the training when the validation accuracy is no longer increasing. One way to do so is to use the `tf.keras.callbacks.EarlyStopping` callback." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-to23J3Vy5d3" + }, + "source": [ + "## Export the model\n", + "\n", + "In the code above, you applied the `TextVectorization` layer to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the `TextVectorization` layer inside your model. To do so, you can create a new model using the weights you just trained." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FWXsMvryuZuq" + }, + "outputs": [], + "source": [ + "export_model = tf.keras.Sequential([\n", + " vectorize_layer,\n", + " model,\n", + " layers.Activation('sigmoid')\n", + "])\n", + "\n", + "export_model.compile(\n", + " loss=losses.BinaryCrossentropy(from_logits=False), optimizer=\"adam\", metrics=['accuracy']\n", + ")\n", + "\n", + "# Test it with `raw_test_ds`, which yields raw strings\n", + "loss, accuracy = export_model.evaluate(raw_test_ds)\n", + "print(accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TwQgoN88LoEF" + }, + "source": [ + "### Inference on new data\n", + "\n", + "To get predictions for new examples, you can simply call `model.predict()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QW355HH5L49K" + }, + "outputs": [], + "source": [ + "examples = [\n", + " \"The movie was great!\",\n", + " \"The movie was okay.\",\n", + " \"The movie was terrible...\"\n", + "]\n", + "\n", + "export_model.predict(examples)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MaxlpFWpzR6c" + }, + "source": [ + "Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for [train/test skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew).\n", + "\n", + "There is a performance difference to keep in mind when choosing where to apply your TextVectorization layer. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization layer inside your model when you're ready to prepare for deployment.\n", + "\n", + "Visit this [tutorial](https://www.tensorflow.org/tutorials/keras/save_and_load) to learn more about saving models." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eSSuci_6nCEG" + }, + "source": [ + "## Exercise: multi-class classification on Stack Overflow questions\n", + "\n", + "This tutorial showed how to train a binary classifier from scratch on the IMDB dataset. As an exercise, you can modify this notebook to train a multi-class classifier to predict the tag of a programming question on [Stack Overflow](http://stackoverflow.com/).\n", + "\n", + "A [dataset](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) has been prepared for you to use containing the body of several thousand programming questions (for example, \"How can I sort a dictionary by value in Python?\") posted to Stack Overflow. Each of these is labeled with exactly one tag (either Python, CSharp, JavaScript, or Java). Your task is to take a question as input, and predict the appropriate tag, in this case, Python. \n", + "\n", + "The dataset you will work with contains several thousand questions extracted from the much larger public Stack Overflow dataset on [BigQuery](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow), which contains more than 17 million posts.\n", + "\n", + "After downloading the dataset, you will find it has a similar directory structure to the IMDB dataset you worked with previously:\n", + "\n", + "```\n", + "train/\n", + "...python/\n", + "......0.txt\n", + "......1.txt\n", + "...javascript/\n", + "......0.txt\n", + "......1.txt\n", + "...csharp/\n", + "......0.txt\n", + "......1.txt\n", + "...java/\n", + "......0.txt\n", + "......1.txt\n", + "```\n", + "\n", + "Note: To increase the difficulty of the classification problem, occurrences of the words Python, CSharp, JavaScript, or Java in the programming questions have been replaced with the word *blank* (as many questions contain the language they're about).\n", + "\n", + "To complete this exercise, you should modify this notebook to work with the Stack Overflow dataset by making the following modifications:\n", + "\n", + "1. At the top of your notebook, update the code that downloads the IMDB dataset with code to download the [Stack Overflow dataset](https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz) that has already been prepared. As the Stack Overflow dataset has a similar directory structure, you will not need to make many modifications.\n", + "\n", + "1. Modify the last layer of your model to `Dense(4)`, as there are now four output classes.\n", + "\n", + "1. When compiling the model, change the loss to `tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)`. This is the correct loss function to use for a multi-class classification problem, when the labels for each class are integers (in this case, they can be 0, *1*, *2*, or *3*). In addition, change the metrics to `metrics=['accuracy']`, since this is a multi-class classification problem (`tf.metrics.BinaryAccuracy` is only used for binary classifiers).\n", + "\n", + "1. When plotting accuracy over time, change `binary_accuracy` and `val_binary_accuracy` to `accuracy` and `val_accuracy`, respectively.\n", + "\n", + "1. Once these changes are complete, you will be able to train a multi-class classifier. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F0T5SIwSm7uc" + }, + "source": [ + "## Learning more\n", + "\n", + "This tutorial introduced text classification from scratch. To learn more about the text classification workflow in general, check out the [Text classification guide](https://developers.google.com/machine-learning/guides/text-classification/) from Google Developers.\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "text_classification.ipynb", + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 }