227 lines
22 KiB
HTML
227 lines
22 KiB
HTML
|
<!DOCTYPE html>
|
||
|
<html lang="en">
|
||
|
<head>
|
||
|
<meta charset="UTF-8">
|
||
|
<title>Extracting data from obfuscated java code | Blog | Felipe Martin</title>
|
||
|
<link rel="stylesheet" href="/static/css/style.css">
|
||
|
<link rel="alternate" type="application/rss+xml" title="RSS Feed for fmartingr.com" href="/feed.xml" />
|
||
|
<link rel="icon" href="/static/images/favicon.ico">
|
||
|
<!-- Mobile -->
|
||
|
<meta name="HandheldFriendly" content="True">
|
||
|
<meta name="MobileOptimized" content="320">
|
||
|
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0">
|
||
|
<meta http-equiv="cleartype" content="on">
|
||
|
|
||
|
|
||
|
</head>
|
||
|
<body class="blog post">
|
||
|
<div class="page-content center">
|
||
|
<header>
|
||
|
<div class="avatar">
|
||
|
<img class="avatar" src="/static/images/avatar.jpg?h=f834fb12">
|
||
|
</div>
|
||
|
<h1>Felipe Martín</h1>
|
||
|
<nav>
|
||
|
<a href="/">/home</a>
|
||
|
|
||
|
<a class="text-bold" href="/blog/">/blog</a>
|
||
|
|
||
|
<a href="/about/">/about</a>
|
||
|
|
||
|
</nav>
|
||
|
</header>
|
||
|
<hr>
|
||
|
<section class="main-content">
|
||
|
|
||
|
|
||
|
|
||
|
<article class="blog-post">
|
||
|
<h1 class="title"><a href="/blog/2013/07/04/extracting-data-from-obfuscated-java-code/">Extracting data from obfuscated java code</a></h1>
|
||
|
<div class="info">
|
||
|
Published on July 04, 2013
|
||
|
</div>
|
||
|
|
||
|
<div class="content">
|
||
|
|
||
|
<p><img src="/blog/2013/07/04/extracting-data-from-obfuscated-java-code/header.png" alt=""></p>
|
||
|
<p>For those who don't know, I started a site a while ago minecraft related (yes,
|
||
|
<a href="/blog/2013/1/12/weekly-project-status-dropping-projects-
|
||
|
hard/">the one I dropped</a>). If you don't know what minecraft is (really?!), you can check <a href="http://minecraft.net/">the
|
||
|
official site</a>, since this game can be a little
|
||
|
difficult to explain.</p>
|
||
|
<p>The project (which is online at
|
||
|
<a href="http://www.minecraftcodex.com">minecraftcodex.com</a>) is just a database of
|
||
|
items, blocks, entities, etc. related to the game, but as in any other site of
|
||
|
this kind, entering all this information can lead to an absolute <em>boredom</em>. So
|
||
|
I thought... what if I can extract some of the data from the game
|
||
|
<em>classfiles</em>? That would be awesome! <em>Spoiler alert</em> I did it.</p>
|
||
|
<blockquote><p>Think this as my personal approach to all steps below: it doesn't mean that they're the best solutions.</p>
|
||
|
</blockquote>
|
||
|
<h2 id="unpackaging-the-jarfile-and-decompiling-the-classes">Unpackaging the jarfile and decompiling the classes</h2><p>First of all, you have a <em>minecraft.jar</em> file that it's just a packaged set of
|
||
|
java compiled files, you can just <code>tar -xf</code> or <code>unzip</code> it into a folder:</p>
|
||
|
<div class="hll"><pre><span></span>unzip -qq minecraft.jar -d ./jarfile
|
||
|
</pre></div>
|
||
|
<p>With this we now have a folder called _jarfile__ _filled with all the jar
|
||
|
contents. We now need to use a tool to decompile all the compiled files into
|
||
|
.java files, because the data we're looking for it's hard-coded into the
|
||
|
source. For this purpose we're going to use <a href="http://varaneckas.com/jad/">JAD</a>,
|
||
|
a java decompiler. With a single line of <em>bash</em> we can look for all the .class
|
||
|
files and decompile them into .java source code:</p>
|
||
|
<div class="hll"><pre><span></span>ls ./jarfile/*.class | xargs -n1 jad -sjava -dclasses &> /dev/null
|
||
|
</pre></div>
|
||
|
<p>All the class files have been converted and for ease of use, we've moved them
|
||
|
into a separate directory. But there's a lot of files! And also, when we open
|
||
|
one...</p>
|
||
|
<div class="hll"><pre><span></span><span class="kd">public</span> <span class="kd">class</span> <span class="nc">aea</span> <span class="kd">extends</span> <span class="n">aeb</span>
|
||
|
<span class="p">{</span>
|
||
|
<span class="kd">public</span> <span class="nf">aea</span><span class="p">()</span>
|
||
|
<span class="p">{</span>
|
||
|
<span class="p">}</span>
|
||
|
|
||
|
<span class="kd">protected</span> <span class="kt">void</span> <span class="nf">a</span><span class="p">(</span><span class="kt">long</span> <span class="n">l</span><span class="p">,</span> <span class="kt">int</span> <span class="n">i</span><span class="p">,</span> <span class="kt">int</span> <span class="n">j</span><span class="p">,</span> <span class="kt">byte</span> <span class="n">abyte0</span><span class="o">[]</span><span class="p">,</span> <span class="kt">double</span> <span class="n">d</span><span class="p">,</span>
|
||
|
<span class="kt">double</span> <span class="n">d1</span><span class="p">,</span> <span class="kt">double</span> <span class="n">d2</span><span class="p">)</span>
|
||
|
<span class="p">{</span>
|
||
|
<span class="n">a</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">,</span> <span class="n">abyte0</span><span class="p">,</span> <span class="n">d</span><span class="p">,</span> <span class="n">d1</span><span class="p">,</span> <span class="n">d2</span><span class="p">,</span> <span class="mf">1.0F</span> <span class="o">+</span> <span class="n">b</span><span class="p">.</span><span class="na">nextFloat</span><span class="p">()</span> <span class="o">*</span> <span class="mf">6F</span><span class="p">,</span> <span class="mf">0.0F</span><span class="p">,</span> <span class="mf">0.0F</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mf">0.5D</span><span class="p">);</span>
|
||
|
<span class="p">}</span>
|
||
|
<span class="c1">// ...</span>
|
||
|
<span class="p">}</span>
|
||
|
</pre></div>
|
||
|
<p>Look at that beautiful obfuscated piece of code! This is getting more
|
||
|
interesting at every step: almost 1.600 java files with obfuscated source
|
||
|
code.</p>
|
||
|
<h2 id="searching-for-the-data">Searching for the data</h2><p>I took the following approach: Since I know what I'm looking for (blocks,
|
||
|
items, etc) and I also know that the information is hard-coded into the
|
||
|
source, there must be some kind of string I can use to search all the files
|
||
|
and get only the ones that contains the pieces of information I look for. For
|
||
|
this test, I used the string "diamond":</p>
|
||
|
<div class="hll"><pre><span></span>$ grep diamond ./classes/*
|
||
|
./classes/bfp.java: "cloth", "chain", "iron", "diamond", "gold"
|
||
|
./classes/bge.java: "cloth", "chain", "iron", "diamond", "gold"
|
||
|
./classes/kd.java: w = (new kc(17, "diamonds", -1, 5, xn.p, k)).c();
|
||
|
./classes/rf.java: null, "mob/horse/armor_metal.png", "mob/horse/armor_gold.png", "mob/horse/armor_diamond.png"
|
||
|
./classes/xn.java: p = (new xn(8)).b("diamond").a(wh.l);
|
||
|
./classes/xn.java: cg = (new xn(163)).b("horsearmordiamond").d(1).a(wh.f);
|
||
|
</pre></div>
|
||
|
<p>As you can see, with a simple word we've filtered down to five files (from
|
||
|
1.521 in this test). Is proof that we can get some information from the source
|
||
|
code and we now to filter even more, looking around some files I selected
|
||
|
another keyword: <em>flintAndSteel</em>, works great here, but in a real example you
|
||
|
will need to use more than one keyword to look for data.</p>
|
||
|
<div class="hll"><pre><span></span>$ grep flintAndSteel ./classes/*
|
||
|
./classes/xn.java: public static xn k = (new xh(3)).b("flintAndSteel");
|
||
|
</pre></div>
|
||
|
<p>Only one file now, we're going to assume that all the items are listed there
|
||
|
and proceed to extract the information.</p>
|
||
|
<h2 id="parsing-the-items">Parsing the items</h2><p>This was the more complicated thing to do. I started doing some regular
|
||
|
expressions to matchs the values I wanted to extract, but soon that became
|
||
|
inneficient due to:</p>
|
||
|
<ul>
|
||
|
<li>The obfuscated code varies with every released version/snapshot -or it should.</li>
|
||
|
<li>The use of OOP difficulted method searching with RegEx matching, since the names could change from version to version, making the tool unusable on updates.</li>
|
||
|
<li>The need to modify the RegEx if something in the code changes, or if we want to extract some other value.</li>
|
||
|
</ul>
|
||
|
<p>After some tests, I decided to <em>convert</em> the java code into python. For that,
|
||
|
I used simple find and match to get the lines that had the definitions I
|
||
|
wanted, something line this:</p>
|
||
|
<div class="hll"><pre><span></span><span class="c1">// As a first simple filter, we only use a code line if a double quote is found on it.</span>
|
||
|
<span class="c1">// Then, regex: /new (?P<code>[a-z]{2}\((?P<id>[1-9]{1,3}).*\"(?P<name>\w+)\"\))/</span>
|
||
|
<span class="c1">// ...</span>
|
||
|
<span class="n">T</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="n">xm</span><span class="p">(</span><span class="mi">38</span><span class="p">,</span> <span class="n">xo</span><span class="p">.</span><span class="na">e</span><span class="p">)).</span><span class="na">b</span><span class="p">(</span><span class="s">"hoeGold"</span><span class="p">);</span>
|
||
|
<span class="n">U</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="n">yi</span><span class="p">(</span><span class="mi">39</span><span class="p">,</span> <span class="n">aqh</span><span class="p">.</span><span class="na">aD</span><span class="p">.</span><span class="na">cE</span><span class="p">,</span> <span class="n">aqh</span><span class="p">.</span><span class="na">aE</span><span class="p">.</span><span class="na">cE</span><span class="p">)).</span><span class="na">b</span><span class="p">(</span><span class="s">"seeds"</span><span class="p">);</span>
|
||
|
<span class="n">V</span> <span class="o">=</span> <span class="p">(</span><span class="k">new</span> <span class="n">xn</span><span class="p">(</span><span class="mi">40</span><span class="p">)).</span><span class="na">b</span><span class="p">(</span><span class="s">"wheat"</span><span class="p">).</span><span class="na">a</span><span class="p">(</span><span class="n">wh</span><span class="p">.</span><span class="na">l</span><span class="p">);</span>
|
||
|
<span class="n">X</span> <span class="o">=</span> <span class="p">(</span><span class="n">vr</span><span class="p">)(</span><span class="k">new</span> <span class="n">vr</span><span class="p">(</span><span class="mi">42</span><span class="p">,</span> <span class="n">vt</span><span class="p">.</span><span class="na">a</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)).</span><span class="na">b</span><span class="p">(</span><span class="s">"helmetCloth"</span><span class="p">);</span>
|
||
|
<span class="n">Y</span> <span class="o">=</span> <span class="p">(</span><span class="n">vr</span><span class="p">)(</span><span class="k">new</span> <span class="n">vr</span><span class="p">(</span><span class="mi">43</span><span class="p">,</span> <span class="n">vt</span><span class="p">.</span><span class="na">a</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)).</span><span class="na">b</span><span class="p">(</span><span class="s">"chestplateCloth"</span><span class="p">);</span>
|
||
|
<span class="c1">// ...</span>
|
||
|
</pre></div>
|
||
|
<p>Since that java code is not python evaluable, just convert it:</p>
|
||
|
<ul>
|
||
|
<li>Remove unmatched parenthesis and double definitions</li>
|
||
|
<li>Remove semicolons</li>
|
||
|
<li>Remove variable definitios</li>
|
||
|
<li>Converted arguments to string. This can be improved a lot, leaving decimals, converting floats to python notation, detecting words for string conversion, etc. Since for now I am not using any of the extra parameters this works for me.</li>
|
||
|
<li>Be careful with reserved python names! (<code>and</code>, <code>all</code>, <code>abs</code>, ...)</li>
|
||
|
</ul>
|
||
|
<div class="hll"><pre><span></span><span class="o">//</span> <span class="n">Java</span><span class="p">:</span> <span class="n">U</span> <span class="o">=</span> <span class="p">(</span><span class="n">new</span> <span class="n">yi</span><span class="p">(</span><span class="mi">39</span><span class="p">,</span> <span class="n">aqh</span><span class="o">.</span><span class="n">aD</span><span class="o">.</span><span class="n">cE</span><span class="p">,</span> <span class="n">aqh</span><span class="o">.</span><span class="n">aE</span><span class="o">.</span><span class="n">cE</span><span class="p">))</span><span class="o">.</span><span class="n">b</span><span class="p">(</span><span class="s2">"seeds"</span><span class="p">);</span>
|
||
|
<span class="n">yi</span><span class="p">(</span><span class="s2">"39"</span><span class="p">,</span> <span class="s2">"aqh.ad.cE"</span><span class="p">,</span> <span class="s2">"aqh.aE.cE"</span><span class="p">)</span><span class="o">.</span><span class="n">b</span><span class="p">(</span><span class="s2">"seeds"</span><span class="p">)</span>
|
||
|
<span class="o">//</span> <span class="n">Java</span><span class="p">:</span> <span class="n">bm</span> <span class="o">=</span> <span class="p">(</span><span class="n">new</span> <span class="n">xi</span><span class="p">(</span><span class="mi">109</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mf">0.3</span><span class="n">F</span><span class="p">,</span> <span class="n">true</span><span class="p">))</span><span class="o">.</span><span class="n">a</span><span class="p">(</span><span class="n">mv</span><span class="o">.</span><span class="n">s</span><span class="o">.</span><span class="n">H</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mf">0.3</span><span class="n">F</span><span class="p">)</span><span class="o">.</span><span class="n">b</span><span class="p">(</span><span class="s2">"chickenRaw"</span><span class="p">);</span>
|
||
|
<span class="n">xi</span><span class="p">(</span><span class="s2">"109"</span><span class="p">,</span> <span class="s2">"2"</span><span class="p">,</span> <span class="s2">"0.3F"</span><span class="p">,</span> <span class="s2">"true"</span><span class="p">)</span><span class="o">.</span><span class="n">a</span><span class="p">(</span><span class="s2">"mv.s.H"</span><span class="p">,</span> <span class="s2">"30"</span><span class="p">,</span> <span class="s2">"0"</span><span class="p">,</span> <span class="s2">"0.3F"</span><span class="p">)</span><span class="o">.</span><span class="n">b</span><span class="p">(</span><span class="s2">"chickenRaw"</span><span class="p">)</span>
|
||
|
</pre></div>
|
||
|
<p>Now I defined an object to match with the java code definitions when
|
||
|
evaluating:</p>
|
||
|
<div class="hll"><pre><span></span><span class="k">class</span> <span class="nc">GameItem</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
|
||
|
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">game_id</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
|
||
|
<span class="bp">self</span><span class="o">.</span><span class="n">id</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">game_id</span><span class="p">)</span>
|
||
|
|
||
|
<span class="k">def</span> <span class="fm">__str__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
|
||
|
<span class="k">return</span> <span class="s2">"<Item(</span><span class="si">%d</span><span class="s2">: '</span><span class="si">%s</span><span class="s2">')>"</span> <span class="o">%</span> <span class="p">(</span>
|
||
|
<span class="bp">self</span><span class="o">.</span><span class="n">id</span><span class="p">,</span>
|
||
|
<span class="bp">self</span><span class="o">.</span><span class="n">name</span>
|
||
|
<span class="p">)</span>
|
||
|
|
||
|
<span class="k">def</span> <span class="nf">method</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
|
||
|
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">args</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span> <span class="ow">and</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="nb">str</span><span class="p">):</span>
|
||
|
<span class="s2">"Sets the name"</span>
|
||
|
<span class="bp">self</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">args</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
|
||
|
<span class="k">return</span> <span class="bp">self</span>
|
||
|
|
||
|
<span class="k">def</span> <span class="fm">__getattr__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
|
||
|
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">method</span>
|
||
|
</pre></div>
|
||
|
<p>As you can see, this class have a global "catch-all" method, since we don't
|
||
|
know the obsfuscated java names, that function will handle every call. In that
|
||
|
concrete class, we now that an object method with only one string parameter is
|
||
|
the one that define the item's name, and we do so in our model.</p>
|
||
|
<p>Now, we will evaluate a line of code that will raise and exception saying that
|
||
|
the class name <em><insert obfuscated class name here></em> is not defined.
|
||
|
With that, we will declare that name as an instance of the GameItem class, so
|
||
|
re-evaluating the code again will return a GameItem object:</p>
|
||
|
<div class="hll"><pre><span></span><span class="k">try</span><span class="p">:</span>
|
||
|
<span class="c1"># Tries to evaluate the piece of code that we converted</span>
|
||
|
<span class="n">obj</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="s1">'code'</span><span class="p">])</span>
|
||
|
<span class="k">except</span> <span class="ne">NameError</span> <span class="k">as</span> <span class="n">error</span><span class="p">:</span>
|
||
|
<span class="c1"># Class name do not exist! We need to define it.</span>
|
||
|
<span class="c1"># Extract class name from the error message</span>
|
||
|
<span class="c1"># Defined somewhere else: class_error_regex = re.compile('name \'(?P<name>\w+)\' is not defined')</span>
|
||
|
<span class="n">class_name</span> <span class="o">=</span> <span class="n">class_error_regex</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">error</span><span class="o">.</span><span class="fm">__str__</span><span class="p">())</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="s1">'name'</span><span class="p">)</span>
|
||
|
<span class="c1"># Define class name as instance of GameItem</span>
|
||
|
<span class="nb">setattr</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">modules</span><span class="p">[</span><span class="vm">__name__</span><span class="p">],</span> <span class="n">class_name</span><span class="p">,</span> <span class="nb">type</span><span class="p">(</span><span class="n">class_name</span><span class="p">,</span> <span class="p">(</span><span class="n">GameItem</span><span class="p">,),</span> <span class="p">{}))</span>
|
||
|
<span class="c1"># Evaluate again to get the object</span>
|
||
|
<span class="n">obj</span> <span class="o">=</span> <span class="nb">eval</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="s1">'code'</span><span class="p">])</span>
|
||
|
</pre></div>
|
||
|
<p>And with this, getting data from source code was possible and really helpful.</p>
|
||
|
<p>A lot of things could be improved from this to get even more information from
|
||
|
the classes, since after spending lot's of time looking for certain patterns
|
||
|
on the code I can say what some/most of the parameters mean, and that means
|
||
|
more automation on new releases!</p>
|
||
|
<h2 id="real-use-case">Real use case</h2><p>Apart from getting the base data for the site (all the data shown on minecraft
|
||
|
codex is directly mined from the source code), I made up a tool that shows
|
||
|
changes from the last comparision -if any. This way I can easily discover what
|
||
|
the awesome mojang team added to the game every snapshot they release:</p>
|
||
|
<p><img src="/blog/2013/07/04/extracting-data-from-obfuscated-java-code/diff.png" alt=""></p>
|
||
|
<p>This is the main tool I use for minecraft codex, is currently bound to the
|
||
|
site itself but I'm refactoring it to made it standalone and publish it on
|
||
|
github.</p>
|
||
|
|
||
|
|
||
|
</div>
|
||
|
|
||
|
|
||
|
<hr />
|
||
|
</article>
|
||
|
|
||
|
<div class="block-info">
|
||
|
If you want to approach me directly about this post use the most appropriate channel
|
||
|
from <a href="/about/">the about page</a>.
|
||
|
</div>
|
||
|
|
||
|
</section>
|
||
|
<hr>
|
||
|
<footer>
|
||
|
Site created using <a target="_blank" href="https://getlektor.com">Lektor</a>. Source code available in <a target="_blank" href="https://github.com/fmartingr/fmartingr.com">Github</a>
|
||
|
</footer>
|
||
|
|
||
|
|
||
|
|
||
|
</body>
|
||
|
</html>
|