MIT 6.830 Lab3

Project Intro

Lab 3需要实现一个基于cost的查询优化模块，其主要作用是在SimpleDB处理Join等SQL语句时可以对输入的SQL查询进行优化；这里的cost会根据SimpleDB数据表中的统计信息计算得出，统计信息则来源于之前Lab中实现的存储模块。

整个查询优化模块的架构如下：

首先需要实现TableStats这个类来计算每个数据表的统计信息，然后进一步实现Join操作的查询估计和嵌套Join的优化这两个核心功能。

数据表统计信息计算（Exercise 1-2）

首先要实现统计直方图类IntHistogram和StringHistogram，用来统计某一列的数据分布情况，然后用每个列的统计数据组成整个数据表的统计信息，即TableStats。

直方图类的实现

实现思路比较直观，假设某个属性（Field）的值是离散分布的，统计出该Field的各值所对应的元组个数，并形成一张直方图，如下：

然后，在执行Join操作的时候，通常会用Filter对元组进行过滤，Filter本身又是可以两两分组的（大于和小于等于，小于和大于等于，等于和不等于），每一组的数据比例总和都为1。因此只需要实现一半的Filter统计即可。

每种Filter的占比对应的面积是不同的，比如大于是分界点右边的面积，小于则是左边的面积，而等于则是在分界点所在的这一块里面用平均值来估算，比如一段对应的值有3，4，5三个，而这一段区域对应的元组数是30，那么在估计该Field值为4的元组数量的时候，就是30/3=10个。

实现一下estimateSelectivity(Predicate.Op op, int v) 这个方法就可以了，对每种不同的Op算子都计算对应的结果。具体实现时仍然把6个op算子都写了一遍。

public double estimateSelectivity(Predicate.Op op, int v) {
    // some code goes here
    int i = binarySearch(v);
    Gram gram;
    gram = i != -1 ? grams[i] : null;
    if (op == Predicate.Op.EQUALS){
        if (gram != null){
            return (gram.h / gram.w) / (double) ntups;
        } else {
            return 0.0;
        }
    } else if (op == Predicate.Op.GREATER_THAN){
        if (v < min){
            return 1.0;
        } else if (v >= max){
            return 0.0;
        } else if (gram != null){
            double res = 0;
            double add = ((double) gram.h / (double) ntups) * ((gram.right - v) / gram.w);
            res += add;
            int j;
            for (j = i+1; j<buckets; j++){
                res += (double) grams[j].h / (double) ntups;
            }
            return res;
        }
    } else if (op == Predicate.Op.LESS_THAN){
        if (v <= min){
            return 0.0;
        } else if (v >= max){
            return 1.0;
        } else if (gram != null){
            double res = 0;
            double add = ((double) gram.h / (double) ntups * (v - gram.left) / gram.w);
            res += add;
            int j;
            for (j = 0; j<i; j++){
                res += (double) grams[j].h / (double) ntups;
            }
            return res;
        }
    } else if (op == Predicate.Op.NOT_EQUALS){
        if (gram != null){
            return 1 - (gram.h / gram.w / (double) ntups);
        } else {
            return 1.0;
        }
    } else if (op == Predicate.Op.GREATER_THAN_OR_EQ){
        if (v <= min){
            return 1.0;
        } else if (v > max){
            return 0.0;
        } else if (gram != null){
            double res = 0;
            double add = ((double) gram.h * ((gram.right - v + 1) / gram.w)) / (double) ntups;
            res += add;
            int j;
            for (j = i+1; j<buckets; j++){
                res += (double) grams[j].h / (double) ntups;
            }
            return res;
        }
    } else if (op == Predicate.Op.LESS_THAN_OR_EQ){
        if (v < min){
            return 0.0;
        } else if (v >= max){
            return 1.0;
        } else if (gram != null){
            double res = 0;
            double add = ((double) gram.h / (double) ntups) * ((v - gram.left + 1) / gram.w);
            res += add;
            int j;
            for (j=0; j<i; j++){
                res += (double) grams[j].h / (double) ntups;
            }
            return res;
        }
    }
    return 0.0;
}

TableStats实现

接下来就需要实现整个数据表的统计信息类TableStats，即给每个Field都建立一个直方图，并维护这些统计信息。

具体实现时先根据TupleDesc初始化一系列直方图，然后调用Heap File的iterator对整个表进行遍历，往直方图里填充数据。

Join查询估计（Exercise 3）

计算每次Join的Cost和Join产生的元组数量，实现的方法分别是estimateJoinCost和estimateJoinCardinality

estimateJoinCost用来估计一次Join操作所需要的Cost，给定一个Join操作左右两边参与的元组数量以及每次操作的cost，来计算整个Join执行一次所需要的cost。

public double estimateJoinCost(LogicalJoinNode j, int card1, int card2,
                               double cost1, double cost2) {
    if (j instanceof LogicalSubplanJoinNode) {
        // A LogicalSubplanJoinNode represents a subquery.
        // You do not need to implement proper support for these for Lab 3.
        return card1 + cost1 + cost2;
    } else {
        // Insert your code here.
        // HINT: You may need to use the variable "j" if you implemented
        // a join algorithm that's more complicated than a basic
        // nested-loops join.
        return cost1 + card1 * cost2 + card1 * card2;
    }
}

estimateJoinCardinality 则被用来估计Join操作产生的元组数量，这个方法实际上和前面实现的数据表统计信息有关，但是相关的调用已经给出了，因此也不需要过多写了。

public int estimateJoinCardinality(LogicalJoinNode j, int card1, int card2,
                                   boolean t1pkey, boolean t2pkey, Map<String, TableStats> stats) {
    if (j instanceof LogicalSubplanJoinNode) {
        // A LogicalSubplanJoinNode represents a subquery.
        // You do not need to implement proper support for these for Lab 3.
        return card1;
    } else {
        return estimateTableJoinCardinality(j.p, j.t1Alias, j.t2Alias,
                j.f1PureName, j.f2PureName, card1, card2, t1pkey, t2pkey,
                stats, p.getTableAliasToIdMapping());
    }
}

嵌套Join优化（Exercise 4）

最后要实现Join的优化，实现的想法为：

暴力遍历所有可能的Join排列，然后分别估计它们的总Cost，然后选出总Cost最小的一组作为查询优化的结果。SimpleDB已经提供了一个方法enumerateSubsets来实现子集的搜索，因此只要调用这个方法，不需要再写一个对Join进行搜索的过程。

在搜索的过程中，使用一个SimpleDB提供的类PlanCache来存储搜索过程中的中间数据，PlanCache会自动排序找出最优的Join顺序，因此只需要在最后调用它的接口返回正确结果就可以了。

public List<LogicalJoinNode> orderJoins(
        Map<String, TableStats> stats,
        Map<String, Double> filterSelectivities, boolean explain)
        throws ParsingException {
    // Not necessary for labs 1 and 2.

    // some code goes here
    int size = joins.size();
    PlanCache planCache = new PlanCache();
    CostCard bestCostCard = null;
    for (int i=1; i<=size; i++){
        for (Set<LogicalJoinNode> s : enumerateSubsets(joins, i)){
            double bestCost = Double.MAX_VALUE;
            bestCostCard = new CostCard();
            for (LogicalJoinNode logicalJoinNode : s){
                CostCard costCard1 = computeCostAndCardOfSubplan(stats, filterSelectivities, logicalJoinNode, s, bestCost, planCache);
                if (costCard1 == null){
                    continue;
                }
                if (costCard1.cost < bestCost){
                    bestCost = costCard1.cost;
                    bestCostCard = costCard1;
                }
            }
            planCache.addPlan(s, bestCost, bestCostCard.card, bestCostCard.plan);
        }
    }
    if (explain){
        assert bestCostCard != null;
        printJoins(bestCostCard.plan, planCache, stats, filterSelectivities);
    }
    assert bestCostCard != null;
    return bestCostCard.plan;
}